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Abstract 


A  central  problem  in  artificial  intelligence  is  to  choose  actions  to  maximize  re¬ 
ward  in  a  partially  observable,  uncertain  environment.  To  do  so,  we  must  obtain 
an  accurate  environment  model,  and  then  plan  to  maximize  reward.  However,  for 
complex  domains,  specifying  a  model  by  hand  can  be  a  time  consuming  process. 
This  motivates  an  alternative  approach:  learning  a  model  directly  from  observations. 
Unfortunately,  learning  algorithms  often  recover  a  model  that  is  too  inaccurate  to 
support  planning  or  too  large  and  complex  for  planning  to  succeed;  or,  they  re¬ 
quire  excessive  prior  domain  knowledge  or  fail  to  provide  guarantees  such  as  statis¬ 
tical  consistency.  To  address  this  gap,  we  propose  spectral  subspace  identification 
algorithms  which  provably  leam  compact,  accurate,  predictive  models  of  partially 
observable  dynamical  systems  directly  from  sequences  of  action-observation  pairs. 
Our  research  agenda  includes  several  variations  of  this  general  approach:  spectral 
methods  for  classical  models  like  Kalman  filters  and  hidden  Markov  models,  batch 
algorithms  and  online  algorithms,  and  kernel-based  algorithms  for  learning  mod¬ 
els  in  high-  and  infinite-dimensional  feature  spaces.  All  of  these  approaches  share 
a  common  framework:  the  model’s  belief  space  is  represented  as  predictions  of  ob¬ 
servable  quantities  and  spectral  algorithms  are  applied  to  learn  the  model  parameters. 
Unlike  the  popular  EM  algorithm,  spectral  learning  algorithms  are  statistically  con¬ 
sistent,  computationally  efficient,  and  easy  to  implement  using  established  matrix- 
algebra  techniques.  We  evaluate  our  learning  algorithms  on  a  series  of  prediction 
and  planning  tasks  involving  simulated  data  and  real  robotic  systems. 
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(bottom).  Here  we  are  displaying  image  values  that  are  clipped  to  stay  within 
the  valid  pixel  range  [0,255].  Images  for  LB-1  are  not  shown.  The  constraint 
generation  synthesized  steam  sequence  is  qualitatively  better  looking  than  the 
steam  sequence  generated  by  LB-1,  although  there  is  little  qualitative  differ¬ 
ence  between  the  two  synthesized  fountain  sequences .  80 


7.3  Bar  graphs  illustrating  decreases  in  objective  function  value  relative  to  the  least 
squares  solution  (A,B)  and  the  running  times  (C,D)  for  different  stable  LDS 
learning  algorithms  on  the  fountain  and  steam  textures  respectively,  based 
on  the  corresponding  columns  of  Table  7.1 .  82 


7.4  (A):  60  days  of  data  for  22  drug  categories  aggregated  over  all  zipcodes  in  the 

city.  (B):  60  days  of  data  for  a  single  drug  category  (cough/cold)  for  all  29  zip- 
codes  in  the  city.  (C):  Sunspot  numbers  for  200  years  separately  for  each  of  the 
12  months.  The  training  data  (top),  simulated  output  from  constraint  generation, 
output  from  the  unstable  least  squares  model,  and  output  from  the  over-damped 
LB-1  model  (bottom) .  83 
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8.1  Learning  and  Planning  in  the  Autonomous  Robot  Domain.  (A)  The  robot  uses 
visual  sensing  to  traverse  a  square  domain  with  multi-colored  walls  and  a  cen¬ 
tral  obstacle.  Examples  of  images  recorded  by  the  robot  occupying  two  different 
positions  in  the  environment  are  shown  at  the  bottom  of  the  figure.  (B)  A  to- 
scale  3-dimensional  view  of  the  environment.  (C)  The  2nd  and  3rd  dimension 
of  the  learned  subspace  (the  first  dimension  primarily  contained  normalization 
information).  Each  point  is  the  embedding  of  a  single  history,  displayed  with 
color  equal  to  the  average  RGB  color  in  the  first  image  in  the  highest  probabil¬ 
ity  test.  (D)  The  same  points  in  (C)  projected  onto  the  environment’s  geometric 
space.  (E)  The  value  function  computed  for  each  embedded  point;  lighter  indi¬ 
cates  higher  value.  (F)  Policies  executed  in  the  learned  subspace.  The  red,  green, 
magenta,  and  yellow  paths  correspond  to  the  policy  executed  by  a  robot  with 
starting  positions  facing  the  red,  green,  magenta,  and  yellow  walls  respectively. 

(G)  The  paths  taken  by  the  robot  in  geometric  space  while  executing  the  policy. 

Each  of  the  paths  corresponds  to  the  path  of  the  same  color  in  (F).  The  darker 
circles  indicate  the  starting  and  ending  positions,  and  the  tick-mark  indicates  the 
robot’s  orientation.  (H)  Analysis  of  planning  from  100  randomly  sampled  start 
positions  to  the  target  image  (facing  blue  wall).  In  the  bar  graph:  the  mean 
number  of  actions  taken  by  the  optimistic  solution  found  by  A*  search  in  config¬ 
uration  space  (left);  the  mean  number  taken  by  the  policy  found  by  Perseus  in  the 
learned  model  (center);  and  the  mean  number  taken  by  a  random  policy  (right). 

The  line  graph  illustrates  the  cumulative  density  of  the  number  of  actions  given 
the  optimal,  learned,  and  random  policies .  95 


9.1  Experimental  Results.  Error  bars  indicate  standard  error.  (A)  Estimating  the 
value  function  with  a  small  number  of  informative  features.  PSTD  and  PSTD2 
both  do  well.  (B)  Estimating  the  value  function  with  a  small  set  of  informa¬ 
tive  features  and  a  large  set  of  random  features.  LARS-TD  is  designed  for  this 
scenario  and  dramatically  outperforms  PSTD  and  LSTD,  however  it  does  not 
outperform  PSTD2.  (C)  Estimating  the  value  function  with  a  large  set  of  semi- 
informative  features.  PSTD  is  able  to  determine  a  small  set  of  compressed  fea¬ 
tures  that  retain  the  maximal  amount  of  information  about  the  value  function, 
outperforming  LSTD  by  a  very  large  margin.  (D)  Pricing  a  high-dimensional 
derivative  via  policy  iteration.  The  y-axis  is  expected  reward  for  the  current 
policy  at  each  iteration.  The  optimal  threshold  strategy  (sell  if  price  is  above  a 
threshold  [115])  is  in  black,  LSTD  (16  canonical  features)  is  in  blue,  LSTD  (on 
the  full  220  features)  is  cyan,  LARS-TD  (feature  selection  from  set  of  220)  is  in 
green,  and  PSTD  (16  dimensions,  compressing  220  features  (16  +  204))  is  in  red.  108 

10.1  A  general  principle  for  state  space  discovery.  We  can  think  of  state  as  a  statis¬ 

tic  of  history  that  is  minimally  sufficient  to  predict  future  observations.  If  the 
bottleneck  is  a  rank  constraint,  then  we  get  a  spectral  method . 114 
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10.2  Spectral  SLAM  on  simulated  data.  See  Section  10.4.1  for  details.  A.)  Randomly 

generated  landmarks  (6  of  them)  and  robot  path  through  the  environment  (500 
timesteps).  A  SVD  of  the  squared  distance  matrix  recovers  a  linear  transform 
of  the  landmark  and  robot  positions.  Given  the  coordinates  of  4  landmarks,  we 
can  recover  the  landmark  and  robot  positions  in  their  original  coordinates;  or, 
since  500  >  9,  we  can  recover  positions  up  to  an  orthogonal  transform  with  no 
additional  information.  Despite  noisy  observations,  the  robot  recovers  the  true 
path  and  landmark  positions  with  very  high  accuracy.  B.)  The  convergence  of  the 
observation  model  for  the  remaining  two  landmarks:  mean  Frobenius-norm 
error  vs.  number  of  range  readings  received,  averaged  over  1000  randomly  gen¬ 
erated  pairs  of  robot  paths  and  environments.  Error  bars  indicate  95%  confidence 
intervals . 117 

10.3  The  autonomous  lawn  mower  and  spectral  SLAM.  A.)  The  robotic  lawn  mower 

platform.  B.)  In  the  first  experiment,  the  robot  traveled  1.9km  receiving  3,529 
range  measurements.  This  path  minimizes  the  effect  of  heading  error  by  balanc¬ 
ing  the  number  of  left  turns  with  an  equal  number  of  right  turns  in  the  robot’s 
odometry  (a  commonly  used  path  pattern  in  lawn  mowing  applications).  The 
light  blue  path  indicates  the  robot’s  true  path  in  the  environment,  light  purple 
indicates  dead-reckoning  path,  and  dark  blue  indicates  the  spectral  SLAM  local¬ 
ization  result.  C.)  In  the  second  experiment,  the  robot  traveled  1.3km  receiving 
1,816  range  measurements.  This  path  highlights  the  effect  of  heading  error  on 
dead  reckoning  performance  by  turning  in  the  same  direction  repeatedly.  Again, 
spectral  SLAM  is  able  to  accurately  recover  the  robot’s  path . 122 

10.4  Comparison  of  Range-Only  SLAM  Algorithms.  The  table  shows  Localization 
RMSE.  Spectral  SLAM  has  localization  accuracy  comparable  to  batch  optimiza¬ 
tion  on  its  own.  The  best  results  (boldface  entries)  are  obtained  by  initializing 
nonlinear  batch  optimization  with  the  spectral  SLAM  solution.  The  graph  com¬ 
pares  runtime  of  Gauss-Newton  batch  optimization  with  spectral  SLAM.  Empir¬ 
ically,  spectral  SLAM  is  3-4  orders  of  magnitude  faster  than  batch  optimization 

on  the  autonomous  lawnmower  datasets . 124 
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Chapter  1 
Introduction 


Many  problems  in  machine  learning  and  statistics  involve  collecting  high-dimensional  multivari¬ 
ate  observations  or  sequences  of  observations,  and  then  hypothesizing  a  compact  model  which 
explains  these  observations.  An  appealing  representation  for  such  a  model  is  a  latent  variable 
model  that  relates  a  set  of  observed  variables  to  an  additional  set  of  unobserved  or  hidden  vari¬ 
ables.  Examples  of  popular  latent  variable  models  include  latent  tree  graphical  models  and  dy¬ 
namical  system  models,  both  of  which  occupy  a  fundamental  place  in  engineering,  control  the¬ 
ory,  economics  as  well  as  the  physical,  biological,  and  social  sciences.  Unfortunately,  to  discover 
the  right  latent  state  representation  and  model  parameters,  we  must  solve  difficult  structural  and 
temporal  credit  assignment  problems. 

Prior  to  the  work  described  in  this  thesis,  research  on  learning  latent  variable  models  and  dy¬ 
namical  systems  has  predominantly  relied  on  likelihood  maximization  and  local  search  heuristics 
such  as  expectation  maximization  (EM);  these  heuristics  often  lead  to  a  search  space  with  a  host 
of  bad  local  optima,  and  may  therefore  require  impractically  many  restarts  to  reach  a  prescribed 
training  precision. 

This  thesis  will  focus  on  two  complementary  ideas:  predictive  representations  and  spectral 
learning  algorithms.  These  models  and  algorithms  hold  the  promise  of  overcoming  the  problems 
inherent  in  previous  approaches:  unlike  standard  latent  variable  models,  predictive  representa¬ 
tions  can  be  naturally  written  in  terms  of  observable  quantities;  and  unlike  the  EM  algorithm, 
spectral  methods  for  learning  predictive  representations  are  computationally  efficient,  statisti¬ 
cally  consistent,  and  have  no  local  optima;  in  addition,  they  can  be  simple  to  implement,  and 
have  state-of-the-art  practical  performance  for  many  interesting  learning  problems. 


1.1  Main  Contributions 

This  thesis  included  the  following  major  contributions: 

•  We  develop  a  novel  spectral  algorithm  for  learning  constant-covariance  Kalman  filters. 
While  we  are  not  the  first  to  develop  spectral  approaches  to  learning  Kalman  filters,  we 
developed  this  method  specifically  to  demonstrate  how  spectral  algorithms  for  learning 
Kalman  filters  are  similar  to  spectral  algorithms  for  learning  nonlinear  dynamical  system 
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models.  We  also  develop  a  novel  method  for  enforcing  the  stability  of  learned  linear  dy¬ 
namical  system  models,  an  important  consideration  when  learning  models  for  practical 
applications. 

•  We  develop  spectral  learning  algorithms  for  discrete-observation  discrete-action  Predictive 
State  Representations  (PSRs)  which  are  an  expressive  class  of  controlled  dynamical  sys¬ 
tem  models  that  includes  well-known  models  like  Hidden  Markov  Models  and  Partially 
Observable  Markov  Decision  Processes.  One  of  the  advantages  of  PSRs  is  that  they  are 
very  compact  models  that  have  the  ability  to  represent  very  large  discrete  state  spaces  with 
low-dimensional  representations.  We  demonstrate  the  power  and  utility  of  our  learning 
algorithm  by  proving  that  it  can  leam  Hidden  Markov  Models  efficiently,  by  showing  that 
we  can  leam  a  compact  model  of  simulated  robotic  agent  that  supports  planning,  and  by 
improving  least  squares  temporal  difference  learning. 

•  We  extend  spectral  learning  algorithms  for  discrete  PSRs  to  use  continuous  features  of 
actions  and  observations.  We  show  how  to  generalize  Bayes’  rule  to  feature  spaces,  and 
leverage  this  to  create  a  very  general  dynamical  system  model  and  an  efficient  spectral 
learning  algorithm.  This  is  an  important  advance,  in  that  it  makes  learning  complicated 
non-linear  models  significantly  easier  and  more  data-efficient.  We  link  these  feature -based 
spectral  algorithms  to  non-parametric  dynamical  system  models  and  spectral  approaches 
to  learning  models  embedded  in  reproducing  kernel  Hilbert  spaces. 

•  We  show  how  spectral  learning  algorithms  can  be  leveraged  to  problems  beyond  dynamical 
system  learning  like  range-only  simultaneous  localization  and  mapping  and  robot  system 
identification.  The  approaches  developed  may  be  adapted  to  future  problems  in  robot 
vision,  mapping,  and  dynamical  system  learning. 

The  combination  of  predictive  representations  and  spectral  learning  algorithms  is  an  exciting  new 
area  of  research  within  machine  learning.  Since  original  publication,  many  of  the  novel  ideas 
and  algorithms  in  this  thesis  have  been  embraced  and  extended  by  the  machine  learning  research 
community,  and  many  of  the  core  principles  are  starting  to  be  applied  to  problems  outside  of  dy¬ 
namical  system  learning.  Much  of  the  research  includes  extensions  and  theoretical  contributions 
to  the  algorithms  and  theory  in  this  thesis  [4,  7,  33,  68,  106].  Additionally,  several  researchers 
have  gone  beyond  learning  models  of  dynamical  systems  to  develop  predictive  representations 
and  associated  spectral  learning  algorithms  for  learning  graphical  models  [1,  74,  100, 101],  latent 
Dirichlet  allocation  [2],  and  probabilistic  context  free  grammars  [5,  25,  26]. 


1.2  Organization 

The  remainder  of  the  thesis  is  organized  as  follows: 


Part  I:  Spectral  Learning  Algorithms  for  Predictive  Representations  We  develop  a  family 
of  spectral  learning  algorithms  for  learning  predictive  representations  of  dynamical  systems. 
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Chapter  2  We  introduce  a  spectral  learning  algorithm  for  a  simple  class  of  linear  dynamical 
systems  with  constant-covariance  Gaussian  noise.  The  algorithm  proceeds  by  building  covari¬ 
ance  matrices  of  observable  quantities  and  then  leverages  linear  algebra  to  leam  the  model  pa¬ 
rameters.  The  goal  of  this  chapter  is  to  introduce  the  general  spectral  learning  approach  on  a 
simple  well-known  dynamical  system. 


Chapter  3  We  generalize  this  approach  to  learning  dynamical  systems  to  a  much  more  ex¬ 
pressive  class  of  models  called  Predictive  State  Representations  (PSRs)  which  include  familiar 
models  like  Hidden  Markov  Models  and  Partially  Observable  Markov  Decision  Processes  as  spe¬ 
cial  cases.  Unlike  Kalman  filters,  these  models  have  non-linear  dynamics  and  assume  discrete 
action,  observation,  and  state  spaces. 


Chapter  4  We  proceed  to  expand  upon  the  work  in  Chapter  3  by  developing  a  learning  algo¬ 
rithm  based  on  continuous  features  of  action  and  observation  spaces.  This  allows  us  to  learn 
models  even  in  the  presence  of  complicated  dynamical  systems  with  very  high  cardinality  dis¬ 
crete  state,  action,  and  observation  spaces.  The  basic  learning  algorithms  presented  in  Chapter  3 
and  Chapter  4  are  the  core  contributions  of  this  thesis. 


Chapter  5  We  use  recent  developments  in  Hilbert  space  embeddings  of  distributions  to  ex¬ 
tend  our  learning  algorithms  to  cover  dynamical  systems  with  continuous  action  and  observation 
spaces.  The  learning  algorithms  in  this  chapter  are  qualitatively  similar  to  those  in  previous  chap¬ 
ters  but  leverage  infinite-dimensional  feature-spaces  and  use  the  kernel  trick  duming  learning. 


Chapter  6  We  provide  efficient  batch  and  online  learning  algorithms  that  use  a  number  of 
linear  algebra  tricks  to  leam  from  massive  quantities  of  data. 


Part  II:  Spectral  Learning  Algorithms  in  Practice  The  second  part  of  this  thesis  focuses 
on  extensions  to  the  basic  algorithms  described  in  Part  I  to  handle  practical  problems  in  system 
identification,  reinforcement  learning,  robotics,  and  related  fields. 


Chapter  7  We  look  at  the  problem  of  learning  back  parameters  of  stable  Kalman  filters  with 
small  quantities  of  training  data.  Although  we  may  know  that  the  dynamics  of  the  system  are 
stable  a  priori ,  this  is  not  enforced  in  the  spectral  learning  algorithm.  We  provide  a  way  of 
enforcing  stability. 


Chapter  8  We  look  at  the  problem  of  planning  in  a  learned  Predictive  State  Representation. 
We  demonstrate  the  advantages  of  this  approach  over  previous  methods  and  show  that  one  can 
indeed  leam  a  model  of  an  environment  and  plan  in  the  learned  model. 
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Chapter  9  We  use  theory  from  the  first  part  of  the  thesis  to  link  temporal  difference  learning 
to  spectral  system  identification.  Using  these  insights  we  are  able  to  build  a  novel  temporal- 
difference  learning  algorithm  that  outperforms  previous  methods  on  difficult  policy  learning 
problems. 

Chapter  10  We  use  insights  from  spectral  system  identification  to  develop  a  novel  spectral 
learning  approach  to  range-only  simultaneous  localization  and  mapping. 

Part  III:  Conclusions  We  summarize  the  main  contributions  of  this  thesis.  We  finally  conclude 
with  a  discussion  of  open  problems  and  directions  for  future  research. 

Part  IV:  Appendices  Contains  the  appendices  for  the  chapters  in  Parts  I  and  II. 
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Part  I 

Spectral  Learning  Algorithms  for 
Predictive  Representations 
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Chapter  2 

A  Spectral  Learning  Algorithm  for 
Constant- Covariance  Kalman  Filters 


Many  problems  in  machine  learning  involve  sequences  of  real-valued  multivariate  observations. 
To  model  the  statistical  properties  of  such  data,  it  is  often  sensible  to  assume  each  observation  to 
be  correlated  to  the  value  of  an  underlying  latent  variable,  or  state,  that  is  evolving  over  the  course 
of  the  sequence.  In  the  case  where  the  state  is  real-valued  and  the  noise  terms  are  assumed  to  be 
Gaussian,  the  resulting  model  is  called  a  linear  dynamical  system  (LDS).  LDSs  are  an  important 
tool  for  modeling  time  series  in  engineering,  controls  and  economics  as  well  as  the  physical  and 
social  sciences.  In  this  chapter  we  define  a  LDS  and  describe  some  of  the  inference  and  learning 
algorithms  as  well  as  review  the  property  of  stability  as  it  relates  to  the  LDS  transition  model, 
which  will  be  relevant  in  Chapter  7.  More  details  on  LDSs  and  algorithms  for  inference  and 
learning  LDSs  (including  spectral  learning  algorithms)  can  be  found  in  several  canonical  papers 
and  standard  references  [35,  48,  51,  62,  117]. 


2.1  The  State  Space  Equations 


The  evolution  of  a  stochastic  linear  time-invariant  dynamical  system  (LDS)  can  be  described  by 
the  following  two  equations: 


x(ht+ 1)  =  Ax(ht)  +wt  wt  ~  J\f( 0,  Q ) 

yt  =  Cx(ht )  +  vt  vt  ~  J\f( 0,  R ) 


(2.1a) 

(2.1b) 


Time  is  indexed  by  the  discrete1  variable  t.  Here  ht  is  a  history  of  observations  up  to  time  t, 
x(ht)  denotes  the  hidden  states  in  W1,  yt  the  observations  in  W"\  and  the  parameters  of  the 
system:  the  dynamics  matrix  A  €  Mnxn  and  the  observation  model  C  €  Mmxn.  The  variables  wt 
and  vt  describe  zero-mean  normally  distributed  process  and  observation  noise  respectively,  with 
covariance  and  cross-covariance  matrices 


r  wt 

r  t  „,t  i 

'  Q 

s  ' 

[  Vt 

L  ws  Vs  \ 

_  sT 

R 

(2.2) 


'in  continuous-time  dynamical  systems,  the  derivatives  are  specified  as  functions  of  the  current  state.  They  can 
be  converted  to  discrete-time  systems. 
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Figure  2.1:  Graphical  representation  of  the  deterministic-stochastic  linear  dynamical  system.  See 
text  for  details. 

where  Sts  is  the  Kronecker  delta,  0  e  Mnxn  is  non-negative  definite,  and  R  e  Mmxr"  is  positive 
definite,  and  S  G  Mnxm  is  the  cross-covariance,  which  must  satisfy  R  —  SQST  >  0.  Inputs  can 
be  incorporated  into  the  LDS  model  via  a  simple  modification  of  Equations  2.1.  See  [11]  for 
details. 


2.1.1  Inference 

In  this  section  we  describe  the  forwards  and  backwards  inference  algorithms  for  LDS.  More 
details  can  be  found  in  several  sources  [51,  62,  117]. 

Given  a  known  model,  the  distribution  over  state  at  time  t,  P  [Xt  \  yi-.r],  can  be  exactly  com¬ 
puted  in  two  parts:  a  forward  and  a  backward  recursive  pass.  The  forward  pass,  which  is  de¬ 
pendent  on  the  initial  state  x{h\)  and  the  observations  yi:t,  is  known  as  the  Kalman  filter,  and 
the  backward  pass,  which  uses  the  observations  from  yT  to  yt+i,  is  known  as  the  Rauch-Tung- 
Striebel  (RTS)  equations.  The  combined  forward  and  backward  passes  are  together  called  the 
Kalman  smoother.  It  is  worth  noting  that  the  standard  LDS  filtering  and  smoothing  inference 
algorithms  [48,  82]  are  instantiations  of  the  junction  tree  algorithm  for  Bayesian  Networks  on 
the  dynamic  Baysian  network  described  in  Ligure  2.1  (see,  for  example,  Murphy  [70]). 


The  Forward  Pass  (Kalman  Filter) 

Let  the  mean  and  covariance  of  the  belief  state  estimate  P  [Xt  \  y1:t\  at  time  t  be  denoted  by  x(ht) 
and  Pt  respectively.  The  estimates  x(ht)  and  P(ht)  can  be  predicted  from  the  previous  time  step 
and  the  previous  observation  by  Equations  2.3a-f.  Lirst  we  estimate  the  next  state  and  next  state 
covariance  without  correcting  for  an  observation: 


x(ht+ 1)  =  Ax(ht)  (2.3a) 

P(ht+1)  =  AP(ht)AT  +  Q  (2.3b) 

Equation  2.3a  can  be  thought  of  as  applying  the  dynamics  matrix  A  to  the  mean  to  form  an  initial 
prediction  of  x(ht).  Similarly,  Equation  2.3b  can  be  interpreted  as  using  the  dynamics  matrix  A 
and  error  covariance  Q  to  form  an  initial  estimate  of  the  belief  covariance  P(ht+ 1).  The  estimates 
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are  then  adjusted: 


x(ht+ 1)  =  x(ht+ 1)  +  Ktet  (2.3c) 

Pt+ 1  =  A+1  -  KtCPt+1  (2.3d) 

where  the  error  in  prediction  at  the  previous  time  step  (the  innovation)  et-\  and  the  Kalman  gain 
matrix  Kt_\  are  computed  as  follows: 

et  =  yt—  Cx(ht)  (2.3e) 

Kt  =  ( P(ht)C  +  S)T(CP(ht)CT  +  R)-1  (2.3f) 


The  weighted  error  in  Equation  2.3c  corrects  the  predicted  mean  given  an  observation,  and  Equa¬ 
tion  2.3d  reduces  the  variance  of  the  belief  by  an  amount  proportional  to  the  observation  covari¬ 
ance.  Taken  together,  Equations  2.3a-f  define  a  specific  form  of  the  Kalman  filter  known  as  the 
forward  innovation  model. 

2.2  The  Constant- Covariance  Kalman  Filter 

In  this  chapter  we  will  be  focussing  on  learning  the  parameters  A,  C,  and  K  of  the  constant- 
covariance  Kalman  filter.  As  mentioned  in  the  previous  section,  the  Kalman  filter  is  given  by 
the  forward  innovation  model  (Equations  2.3a-f).  We  make  the  additional  assumption  that  the 
covariance  of  this  Kalman  filter,  P{ht),  does  not  vary  with  time  t,  and  we  assume  that  the  linear 
dynamical  system  has  a  stationary  covariance 

Ex,x=E[x(ht)x(ht)T]  (2.4) 

where  T,x,x  is  independent  of  time  t.  Additionally,  we  define  the  following  output  covariances: 

A  =  E  [ytyj] 

=  E  [(Cx{ht)  +  vt)(Cx(ht)  +  ty)T] 

=  CE  [x(ht)x(ht)T]  CT  +  E  [vtvj] 

=  CZX,XCT  +  R  (2.5) 

and 

G  =  E  [x(ht+1)yj] 

=  E  [(Ax(ht)  +  wt)(Cx(ht )  +  ry)T] 

=  AE  [x(ht)x(ht)T]  CT  +  E  \wtvj\ 

=  AEXjXCt  +  S  (2.6) 

The  fact  that  the  constant  state  covariance  P(ht)  =  Exx  implies  that  the  Kalman  gain  K  = 
(G  -  APtCT)( A  -  CPtCTyl  [117]  can  be  written 

K  =  (G  —  APtCT)( A  -  CPtCTy1 

—  (G  -  AExxCt)( A  -  CE^Y1 

=  SR'1  (2.7) 
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Learning  the  Kalman  filter  from  data  involves  finding  the  parameters  6  =  (A,  C,  S.  R}  that 
explain  the  observed  data.  In  principle,  the  maximum  likelihood  solution  for  these  parameters  can 
be  found  through  the  iterative  EM  algorithm;  but  in  practice  the  EM  algorithm  often  fails  due 
to  local  optima  [35].  An  alternative  approach  is  to  use  subspace  system  identification  methods 
to  compute  an  asymptotically  unbiased  solution  in  closed  form.  In  practice,  subspace  identifica¬ 
tion  is  faster,  more  computationally  robust,  and  easier  to  implement  than  maximum  likelihood 
approaches  [35];  so  that  is  what  we  consider  here. 

Subspace  methods  calculate  the  parameters  of  an  LDS  by  using  tools  from  linear  algebra 
including  the  oblique  projection  and  the  singular  value  decomposition  (S  VD)  [39]  to  find  Kalman 
filter  estimates  of  the  underlying  state  sequence  in  closed  form.  We  discuss  a  novel  subspace 
identification  algorithm  here.  (See  [117]  for  variations.) 

2.2.1  Observable  Representations 

The  key  insight  to  spectral  system  identification  is  that  the  parameters  of  a  LDS  can  be  written 
solely  in  terms  of  observable  quantities.  In  this  section  we  will  show  how  to  write  down  an 
observable  representation  of  a  Kalman  filter  and  in  Section  2.3  we  will  design  a  spectral  learning 
algorithm  that  leverages  this  observable  representation. 

We  begin  by  defining  infinite  “matrices”  of  observations.  Let  Y0 \Np  be  defined  as: 

Vo  yi 

yi  i)2 

IjNp  UNp+1 

We  will  use  H  to  denote  a  matrix  of  past  observations  or  histories.  We  can  think  of  this  matrix 
as  consisting  of  columns  of  iVp-length  histories  ht.  Specifically,  we  define 

H  =  Y0]Np  =  [hx  h2  . . .] 

Similarly,  let  F,  F  denote  matrices  of  “future”  observations  and  the  one-step  shifted  future 
respectively.  We  can  think  of  these  matrices  as  consisting  of  columns  of  A^-length  future  se¬ 
quences  of  observations  ft.  These  are  defined  as 

F  =  YNp+l\NP+NF  =  [fl  fl  •  •  •]  F+  =  YNp+2\Np+NF+l  =  [/2  fd,  •  •  •] 

Note  that  the  matrices  H,  F  and  F+  all  have  the  same  form:  in  each  matrix,  each  block  of  rows 
is  equal  to  the  previous  block  but  shifted  by  a  constant  number  of  columns.  Such  matrices  are 
called  block  Hankel  matrices  and  play  an  important  role  in  spectral  system  identification  [62]. 
Finally,  let 

X  =  x{h2)  . . .]  G  Mnxo°  (2.9) 

be  a  set  of  Kalman  filter  state  estimates  derived  from  the  same  set  of  observations.  These  state 
estimates  are,  by  definition,  an  expected  set  of  features  given  a  history  that  are  sufficient  to  predict 
future  observations. 
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Thus,  if  the  observations  arise  from  an  LDS,  then,  by  the  forward  innovation  model  (Equa¬ 
tion  2.1): 


Cx(hi) 

Cx(h2 ) 

CAx(hi) 

CAx(h2) 

E  [F  |  X]  = 

CA2x(hi) 

CA2x(h2) 

CANF~lx(hi) 

i 

••  7 

£ 

O 

Next  we  define  several  matrices  that  will  eventually  let  us  leam  the  parameters  of  a  LDS.  Let  T 
(sometimes  called  the  extended  observability  matrix)  be  defined: 


p  def 


c 

CA 

CA2 


CANf~1 


m-NjpXn 


(2.11) 


T  allows  one  to  compute  the  expected  future  given  state:  e.g.  E[/i  j  hi]  =  Tx(hi).  And,  the 
rows  of  T  allow  us  to  compute  the  expectation  of  current  and  future  observations:  e.g.  E{yNp+-[  j 
hi]  ri:mx(/ii). 

Next  we  define  the  covariance  of  states  and  Ap-length  histories  T,x,n  G  Rnx  Np : 


=  E  [X^Hjtt] 

=  E  [E[Xiit  I  htMHTt  I  ht]\ 
=  E [Xi(ht)E[HTt  |  ht]] 

=>  R x,n  =  E[x(ht)E[Hj  |  ht]] 


(2.12) 


Although  we  cannot  directly  estimate  this  matrix  from  data,  it  plays  a  central  role  in  our  deriva¬ 
tions  below.  In  particular,  if  we  define  a  matrix 

[£H,H]y  =  E  [HUHjt] 

=>•  2W(W  =  E  (2.13) 

it  can  be  used  in  conjunction  with  YjX,h  to  And  the  Kalman  filter  state  estimate  at  time  t  by 
orthogonal  projection  onto  history  [48]: 


x(ht)  =  Y,XjyY,^nht  (2.14) 

The  Kalman  filter  state  estimate  is  optimal  linear  predictor  xt  from  history  ht,  an  observation 
that  Kalman  made  in  his  seminal  paper  [48].  The  derivation  of  Equation  2.14  is  somewhat 
complicated,  but  the  complete  proof  in  the  context  of  spectral  system  identification  can  be  found 
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in  [117].  Equation  2.14  also  implies  that  the  state  covariance  £.y,x  =  ^[x{ht)x(ht)T]  can  be 
found  from  Ex,n  and  ^h,h' 


Tjx,x  =  E  [x(ht)x(ht)T] 

=  E.y^E^E  [hthj] 

=  (2.15) 


This  fact  is  used  in  our  derivations  below. 

The  two  fundamental  matrices  that  allow  us  to  define  observable  version  of  the  parameters  A 
and  C  for  a  LDS  are  E jr,n  and  Ej-  +  n.  First  we  define  Ejf  ^  G  M.NpxNp,  a  covariance  matrix  of 
past  and  future  sequences  of  observations: 


Prdij  =  E  [Fi)tHl] 

=  E[E  [FittHjt  |  ht]\ 

=  E  [E  [Fijt  |  ht]  E  [Hjtt  |  ht ]] 

=  E  [TiAx(ht) E  [Hj:t  |  ht}} 

=  TjTlE  [x(ht) E  [Hj)t  |  ht]} 

=  TiAUxflij 

=>  Ejf%  =  EAS x,n  (2.16) 

Equation  2.16  tells  us  that  the  rank  of  E x,n  is  no  more  than  n,  since  its  factors  T  and  E x,n  each 
have  rank  no  more  than  n.  At  this  point  we  can  define  a  sufficient  set  of  histories  and  futures:  it 
is  a  set  of  histories  for  which  the  rank  of  Ejr M  is  equal  to  n. 

Next  we  define  Ej-+  H  G  RNfxNp,  a  covariance  matrix  of  past  and  one-step  shifted  futures: 


[E^,«]y  =  E  [F+Hlt\ 

=  E  [E  >;,)//,,  I  ht]] 

=  E  [E  [F+  I  ht]  E  [Hht  I  ht]] 

=  E  [rtA2x(ht)R  [Hht  |  ht]] 

=  TiA2E  [x(ht) E  [Hj,t  |  ht]] 

=  r  iA2Ex,nj 

Ej r+,n  —  TA2Ex,h  (2.17) 

Just  like  Tijrqi,  E x+qc  has  rank  at  most  n  due  to  T  and  E jr,n- 
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Finally,  we  need  one  additional  covariance  matrix  in  order  to  estimate  the  Kalman  gain: 


Pt+Aj  =  E  [F&y]>t] 

=  E  [E  [F+yJtt  |  ht+ 1]] 

=  E  [E  [F+  |  ht+i]  E  [yjtt  |  ht+1]] 

=  E  [TiAx(ht+1)E  [yjit  \  ht+ 1]] 

=  TjAE  [x(ht  +  1)E  [yjjt  \  ht+ 1]] 

=  riAYlx+y.j 

>  ^Jr+,n  =  r^4 TjX+)y  (2.18) 


where  E x+,y  =  E  [x(ht  +  1)E  [yt  \  ht+ 1]].  We  are  now  in  position  to  define  the  parameters  of 
a  Kalman  filter  in  terms  of  observable  covariance  matrices.  The  key  to  this  approach  is  that, 
for  linear  dynamical  systems,  there  are  many  possible  equivalent  representations  for  the  same 
system.  That  is,  two  LDSs  are  said  to  be  equivalent  if  the  second  order  statistics  of  the  output 
generated  by  the  models  is  the  same,  i.e.  the  covariance  sequence  of  the  output  is  identical  [1 17]. 2 
This  means  that  the  Kalman  filter  is  only  defined  up  to  a  similarity  transform:  the  predictions 
made  from  any  of  these  systems  is  the  same.  For  example,  given  a  linear  transform  S,  we  can 
define  parameters  A  =f  SAS _1,  C  '=  CS~ 1 ,  K  =f  SK,  and  x(ht)  '=  Sx(ht).  Predictions 
from  the  transformed  system  are  the  same  as  the  original  system,  e.g.  Cx[ht)  =  CS~1Sx(ht)  = 
Cx(ht).  Filtering  and  simulating  from  the  filter  are  similarly  equivalent.  See  Equations  2.20 
below  for  more  details. 

Practically,  the  equivalence  of  transformed  LDSs  means  that  we  can  pick  an  additional  ma¬ 
trix  U  G  RNFXn  such  that  UTT  is  invertible  and  we  can  work  with  parameters  transformed  by 
UtTA.3  A  natural  choice  for  U  is  the  leading  left  singular  vectors  of  Ej- although  a  randomly 
generated  U  will  work  with  probability  1  (but  will  typically  result  in  slower  learning).  We  now 
define  the  parameters  A  and  C  of  an  LDS  in  terms  of  the  observable  matrices  EjrW,  E 
and  U,  and  simplify  the  definitions  using  Equations  2.19,  to  show  that  our  parameters  are  only  a 


2The  definition  of  equivalence  is  that  the  entire  PDF  of  observation  sequences  is  the  same.  But  it  turns  out  that 
all  we  need  to  check  is  second  order  stats,  since  the  joint  distribution  is  Gaussian. 

3  We  assume  without  loss  of  generality  that  A  is  full  rank,  therefore,  the  transform  U  1  l’,4  is  invertible  since  both 
t/Tr  and  A  are  assumed  to  be  invertible. 
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similarity  transform  away  from  the  original  LDS  parameters: 


x(fk)  =  UTETtHE^Hht 

=  (UTTA)Zx,n^nht 

=  ( UTYA)x{ht )  (2.19a) 

A  =f  UT'Ejr+^n(UT'Ejr^ny 

=  UTrA2Zx,n(UTZ^ny 
=  (uTrA)A(uTrA)-\uTrA)zx,n(uTzx,ny 

=  (UtTA)A(UtTA)-1  (2.19b) 

C  =  L h;mA-1 

=  U1:m((UTTA)A(UTTAy1)-1 
=  U1:m(UTYA)A~1(UTYA)-1 
=  r1:mAA~\UTrA)-1 
=  r1;m(f/TrA)-1 

=  CiU^TA)-1  (2.19c) 

Here  we  used  the  fact  that  the  first  m  rows  of  F  are  C  from  the  definition  of  F  (Equation  2.11). 
In  order  to  filter  the  system  or  to  simulate  from  the  system  we  additionally  need  to  find  the 
observation  covariance  matrix  R  and  the  Kalman  gain  K .  The  Kalman  gain  can  be  found  given 
R  and  a  similarity  transform  of  S  by  a  modification  of  Equation  2.7: 

R  =  A-  CUT^n^nE]rtnUCT 
=  A  -  C(UTVA)Ex^1n^n(UTrA)TCT 
=  A  -  c(uTrA)-\uTrA)zx,nz-]HY,Tx„(uTrA)T(uTrA)-JcT 
=  A  - 

=  A  -  CTEx,xCt 

S  =  C/TEF+ y  -  (7TE^+iWE-^Ej)W(7C,T 
=  (f/TrA)Ex+>1-  -  {UtTA)AT,X)H^\yAf^UCt 
=  (f/TrA)Ex+)y  -  (UtTA)AY>X)H^1hYAx^Ct 
=  (UTTA)T,x+tY  -  {UtTA)AY>X)XCt 
=  (UtTA)(Zx+,y  -  AZx>xCt) 

=  ( UtTA)S 

K  =  SR-1 

=  ( UtTA)SR- 1 
=  ( UtTA)K 


(2.19d) 


(2.19e) 


(2.19f) 
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We  have  now  defined  a  Kalman  filter  in  terms  of  only  observable  quantities.  Given  Equa¬ 
tions  2.19(a-f),  the  state  update  equations  for  our  transformed  Kalman  filter  are 


Vt  =  Cx(ht ) 

=  C{UTVA)~1(UTVA)x{ht)) 

=  Cx(ht)  (2.20) 

x(ht+ 1)  =  Ax(ht)  +  K(yt  -  Cx(ht )) 

=  (UTVA)A(UTTA)-1(UTVA)x(ht)  +  ( UTTA)K(yt  -  C  (UT  V  A)^  (UT  V  A)x{ht)) 
=  ( UTTA)Ax(ht )  +  (f/TrM)A'(?/t  -  C^)) 

=  (UTTA)(Ax(ht)  +  K(yt  -  Cx(ht )))  (2.21) 

The  fact  that  our  Kalman  filter  is  defined  in  terms  of  observable  quantities  means  that  developing 
a  learning  algorithm  is  straightforward. 


2.3  A  Spectral  Learning  Algorithm 

Our  learning  algorithm  works  by  building  empirical  estimates  Ajr-H .  Ejr+  AH:H.  Er+y,  and 
A  of  the  matrices  Ej- £r+ y,  and  A  defined  above.  To  build  these  estimates, 
we  repeatedly  sample  a  history  ht  from  the  distribution  uj  and  record  the  resulting  observations. 
This  data  gathering  strategy  implies  that  we  must  be  able  to  arrange  for  the  system  to  be  in  a  state 
corresponding  to  ht  ~  u. 

In  practice,  reset  is  often  not  available.  In  this  case  we  can  estimate  Ejt-h,  E f+,Hi  EW;W,  E f+,y, 
and  A  ,  by  dividing  a  single  long  sequence  of  action-observation  pairs  into  subsequences  and  pre¬ 
tending  that  each  subsequence  started  with  a  reset.  We  are  forced  to  use  an  initial  distribution 
over  histories,  u>,  equal  to  the  steady  state  distribution  of  the  system. 

Once  we  have  computed  E we  can  generate  U  by  singular  value  decomposition  of  Eyy. 
We  can  then  leam  the  LDS  parameters  by  plugging  U,  E f,h,  E f+,h,  E u,h,  E f+,y,  and  A  into 
Equations  2.19(a-f). 

As  we  include  more  data  in  our  averages,  the  law  of  large  numbers  guarantees  that  our  esti¬ 
mates  E f,ui  EWiW,  SF+iy,  and  A  converge  to  the  true  matrices  E f,u-,  E t+,h->  '^in,n^‘F+,Y, 

and  A.  So  by  continuity  of  the  formulas  above,  if  our  system  is  truly  a  LDS  of  finite  rank,  our 
estimates  A,  C,  R,  and  K  converge  to  the  true  parameters  up  to  a  linear  transform — that  is,  our 
learning  algorithm  is  consistent.4 

Below  we  discuss  some  of  the  important  design  choices  and  extensions  to  the  above  learning 
algorithm  that  are  often  important  for  learning  dynamical  system  models  in  practice. 

4The  pseudoinverses  are  continuous  at  the  true  parameters,  since  the  matrices  to  be  pseudoinverted  have  full 
rank.  The  matrix  of  n  left  singular  vectors  U  may  not  be  a  continuous  function  of  (in  case  of  repeated 

singular  values);  to  deal  with  this  possibility,  we  can  either  fix  U  (say,  as  the  left  singular  vectors  of  our  estimated 
Yijr  after  some  fixed  amount  of  data),  or  we  can  make  a  slightly  more  complex  argument  based  on  the  fact  that  the 
column  span  of  U  is  a  continuous  function  of  n  near  (since  the  ?rth  singular  value  of  is  nonzero, 

and  is  therefore  separated  from  the  (n  +  l)st,  which  is  zero). 
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Figure  2.2:  Sunspot  data,  sampled  monthly  for  200  years.  Each  curve  is  a  month,  the  rc-axis  is 
over  years.  Below  the  graph  are  the  first  two  principal  components  of  O,  where  Yj  and  Yp  each 
consist  of  1-observation  Hankel  matrices  and  12-observation  Hankel  matrices.  The  1-observation 
Hankel  matrices  do  not  contain  enough  observations  to  recover  a  state  which  accurately  reflects 
the  temporal  patterns  in  the  data,  while  the  12-observation  Hankel  matrices  do. 

2.3.1  Learning  from  Sequences  of  Observations 

Having  multiple  observations  per  column  in  F  is  particularly  important  when  the  underlying 
dynamical  system  is  not  completely  observable.  For  example,  Figure  2.2  shows  200  years  of 
sunspot  numbers,  with  each  month  modeled  as  a  separate  variable.  Sunspots  are  known  to  have 
two  periods,  the  longer  of  which  is  11  years.  When  spectral  learning  is  performed  using  Hankel 
matrices  with  i  =  12,  the  first  two  principal  components  of  resemble  the  sine  and  cosine 
bases,  and  the  corresponding  state  variables  are  the  coefficients  needed  to  combine  these  bases  so 
as  to  predict  12  years  of  data.  This  is  in  contrast  to  the  bases  obtained  by  SVD  on  a  1-observation 
Yf  and  Yp,  which  reconstruct  just  the  variation  within  a  single  year.  Thus,  with  i  —  1  the  state 
estimate  will  not  converge  to  the  true  Kalman  filter  state  estimate  even  if  j  — >  oo. 


2.3.2  Stability 

Stability  is  a  property  of  dynamical  systems  defined  in  terms  of  equilibrium  points.  If  all  solu¬ 
tions  of  a  dynamical  system  that  start  out  near  an  equilibrium  state  xe  stay  near  or  converge  to 
xe,  then  the  state  xe  is  stable  or  asymptotically  stable  respectively  (see  Figure  2.3.2).  A  linear 
system  x(ht)  =  Ax(ht )  is  internally  stable  if  the  matrix  A  obeys  the  Fyapunov  stability  criterion 
(see  below).  The  standard  algorithms  for  learning  linear  Gaussian  systems  do  not  enforce  stabil¬ 
ity,  when  learning  from  finite  data  samples,  the  spectral  learning  solution  may  be  unstable  even 
if  the  true  system  is  stable,  due  to  the  sampling  constraints,  modeling  errors,  and  measurement 
noise.  A  dynamics  matrix  A  is  said  to  be  asymptotically  stable  in  the  sense  of  Fyapunov  if,  for 
any  given  positive  semi-definite  symmetric  matrix  Q  there  exists  a  positive-definite  symmetric 
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state 


Figure  2.3:  System  equilibria.  (A)  Unstable  equilibrium.  The  state  vector  will  rapidly  move 
away  from  the  equilibrium  point  when  perturbed.  (B)  Asymptotically  stable  equilibrium.  The 
state  vector  will  return  to  the  original  equilibrium  point  when  perturbed.  (C.)  Stable  equilibrium. 
No  “resistance;”  a  perturbed  state  vector  will  oscillate  forever  around  the  equilibrium  point.  Note 
that  the  notion  of  asymptotic  stability  is  stronger  than  stability. 

matrix  P  that  satisfies  the  following  Lyapunov  criterion : 

P  -  APAt  =  Q  (2.22) 

An  LDS  is  said  to  be  asymptotically  stable  in  the  sense  of  Lyapunov  if  its  dynamics  matrix  A  is. 
Thus,  the  Lyapunov  criterion  can  be  interpreted  as  holding  for  an  LDS  if,  for  a  given  covariance 
matrix,  there  exists  a  belief  distribution  where  the  predicted  belief  over  state  is  equivalent  to  the 
previous  belief  over  state,  that  is,  if  there  exists  an  equilibrium  point.  Below  we  will  relate  stable 
in  the  sense  of  Lyapunov  to  stability  and  asymptotic  stability 

It  is  interesting  to  note  that  the  Lyapunov  criterion  holds  iff  the  spectral  radius  p{A)  <  1. 
Recall  that  a  matrix  M  is  positive  semi-definite  iff  zMzT  >  0  for  all  non-zero  vectors  z.  Let  A 
be  an  left  eigenvalue  of  A  and  v  be  a  corresponding  eigenvector,  giving  us  uA  =  v\ ,  then 

vQff  =  u(P  -  AT P A) oT  =  oPoT  -  oXPXoT  =  uPuT(l  -  |A|2)  (2.23) 

Since  uPuT  >  0,  it  follows  that  |A|  <  1  is  equivalent  to  uQuT  >  0,  and  therefore  to  Equation 
2.23.  When  p{A)  <  1,  the  system  is  asymptotically  stable.  To  see  this,  suppose  DAD~l  is  the 
eigen-decomposition  of  A,  where  A  has  the  eigenvalues  of  A  along  the  diagonal  and  D  contains 
the  eigenvectors.  Then, 

lim  Ak  =  lim  DAkD~1  =  D  ( lim  Ak)  D~1  =  0  (2.24) 

k— >oo  k— >oo  \k — ^oo  / 

since  it  is  clear  that  lim^oo  Afc  =  0.  If  A  =  I,  then  A  is  stable  but  not  asymptotically  stable 
and  the  state  oscillates  around  xe  indefinitely.  However,  this  is  true  only  under  the  assumption 
of  zero  noise.  In  the  case  of  an  LDS  with  Gaussian  noise,  a  dynamics  matrix  with  unit  spectral 
radius  would  cause  the  state  estimate  to  move  steadily  away  from  xe.  Hence  such  an  LDS  is 
asymptotically  stable  only  when  p{ A)  is  strictly  less  than  1 . 

Stability  is  a  desirable  characteristic  for  linear  dynamical  systems,  but  it  is  often  ignored  by 
algorithms  that  learn  these  systems  from  data.  When  the  amount  of  training  data  is  small,  the 


17 


learned  system  is  often  unstable  p(A)  >  1.  In  Chapter  7  and  [88]  we  propose  a  novel  method 
for  learning  guaranteed  stable  linear  dynamical  systems:  we  formulate  an  approximation  of  the 
problem  as  a  convex  program,  start  with  a  solution  to  a  relaxed  version  of  the  program,  and 
incrementally  add  constraints  to  improve  stability.  We  show  that  this  approach  can  drastically 
increase  the  performance  and  utility  of  learned  Kalman  filters  in  practice. 


2.4  Conclusions 

In  this  chapter  we  presented  a  spectral  learning  algorithm  for  constant-covariance  Kalman  filters. 
This  algorithm  is  novel,  though  quite  similar  to  previous  subspace  identification  algorithms  for 
linear  dynamical  systems  [51,  117].  Additional  experimental  results  using  this  algorithm  (in  the 
context  of  learning  stable  Kalman  filters)  can  be  found  in  Chapter  7. 

Several  ideas  that  are  featured  in  this  chapter  feature  prominently  in  our  contributions  to 
learning  the  more  general  algorithms  discussed  in  the  next  several  chapters.  In  particular,  we 
exploit  the  fact  that  dynamical  systems  can  often  be  written  in  terms  of  observable  covariance 
matrices  and  states  can  be  viewed  as  predictive  statistics.  When  this  is  the  case,  the  state  space 
of  the  dynamical  system  can  be  found  by  a  spectral  decomposition  of  the  covariance  of  past 
and  future  observations.  Finally,  the  state  update  can  be  found  via  application  of  a  (potentially 
generalized)  version  of  Bayes’  rule. 
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Chapter  3 

Spectral  Learning  Algorithms  for 
Predictive  State  Representations 

3.1  Introduction 

In  the  previous  chapter  we  looked  the  problem  of  learning  a  Kalman  filter  directly  from  data. 

In  this  case  we  assumed  a  latent  variable  model  with  linear  transition  and  observation  functions 
and  Gaussian  noise.  For  many  systems,  these  assumptions  are  overly  restrictive.  In  this  chapter 
we  will  consider  learning  models  of  dynamical  systems  where  the  transitions  are  nonlinear,  the 
states  are  discrete,  and  the  noise  model  is  assumed  to  be  multinomial.  With  a  large  number  of 
states,  these  dynamical  systems  can  be  very  expressive. 

Two  well-known  multinomial  dynamical  system  models  are  Hidden  Markov  Models  (HMMs)  [79] 
and  Partially  Observable  Markov  Decision  Processes  (POMDPs)  [22,  97].  Here  we  will  focus  on 
a  general  multinomial-distributed  dynamical  system  model  called  a  Predictive  State  Represen¬ 
tation  (PSR)  [61]  which  inlcudes  HMMs  and  POMDPs  as  special  cases.  PSRs  and  the  closely 
related  Observable  Operator  Models  (OOMs)  [44]  are  compact  and  complete  descriptions  of 
a  dynamical  system.  PSRs  can  be  viewed  as  generalizations  of  POMDPs  that  have  attracted 
interest  because  they  both  have  greater  representational  capacity  than  POMDPs  and  yield  repre¬ 
sentations  that  are  at  least  as  compact  [31,  93].  In  contrast  to  the  latent- variable  representations 
of  POMDPs,  PSRs  and  OOMs  represent  the  state  of  a  dynamical  system  by  tracking  occur¬ 
rence  probabilities  of  a  set  of  future  events  (called  tests  or  characteristic  events )  conditioned 
on  past  events  (called  histories  or  indicative  events).  Because  tests  and  histories  are  observable 
quantities,  it  has  been  suggested  that  learning  PSRs  and  OOMs  should  be  easier  than  learning 
POMDPs. 

In  this  chapter  we  develop  a  spectral  learning  algorithm  for  PSRs  (and  thus  POMDPs)  that 
shares  the  advantages  of  spectral  learning  algorithms  for  linear  dynamical  systems:  the  algo¬ 
rithm  is  statistically  consistent  and  easy  to  implement  with  simple  linear  algebra  operations.  The 
chapter  is  organized  as  follows.  In  Section  3.2  we  will  present  the  PSR  model  for  dynamical 
systems.  Then  in  Section  3.3  we  will  describe  the  observable  representation  of  PSRs.  Finally,  in 
Section  3.4  we  will  leverage  the  observable  representation  to  develop  a  simple  spectral  learning 
algorithm  for  PSRs. 
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a  set  of  tests 

T 

a  single  test  in  T 

A 

the  observations  in  test  Ti 

T° 

l 

the  actions  in  test  r* 

ta 

l 

the  prediction  of  a  test  r,  at  time  t 

P 

[Tit 

ta 

l.t'l 

ht 

the  prediction  vector  of  all  tests  r  at  time  t 

P 

[r? 

1  ta 

\  't  1 

ht 

shorthand  for  the  prediction  of  test  tz  at  time  t 
shorthand  for  the  prediction  of  all  test  outcomes  at  time  t 


Ti{ht ) 
rOt) 


Table  3.1:  Notation  for  Tests. 

3.2  Predictive  State  Representations 

3.2.1  Tests  and  PSR  Notation 

PSRs  represent  state  as  a  set  of  predictions  of  observable  experiments  or  tests  that  one  could 
perform  in  the  system.  The  notation  for  tests  is  summarized  in  Table  3.1.  Specifically,  a  test  of 
length  Nf  is  an  ordered  sequence  of  future  action-observations  pairs  t,  =  a, ,  o, , . . .  onf 
that  can  be  executed  and  observed  at  any  time  t.  Likewise,  a  history  is  an  ordered  sequence  of 
actions  and  observations  ht  —  oq,  cq, . . . ,  at_i,  ot-i  that  has  been  executed  and  observed  prior  to 
a  given  time  t.  We  will  often  work  with  sets  of  tests;  therefore,  let  T  =  { r, }  be  a  set  of  d  tests. 
A  test  t i  is  said  to  be  executed  at  time  t  if  we  intervene  to  take  the  sequence  of  actions  specified 
by  the  test  rf  =  a\. . . . .  aFl,.  It  is  said  to  succeed  at  time  t  if  it  is  taken  and  if  the  sequence 
of  observations  in  the  test  rf  —  o\, ,  onf  matches  the  observations  generated  by  the  system. 
The  prediction  for  test  i  at  time  t  is  the  probability  of  the  test  succeeding  given  a  history  ht  and 
given  that  we  take  it: 

NF 

P  [rg  j  do  (r^)  ,  ht\  =  P[oi  |  a1,  ht\  JJP  [o*  |  a1:i,  ht]  (3.1) 

i= 2 

We  write  r(ht)  for  the  prediction  vector  of  success  probabilities  for  the  tests  t,  e  T  given  a 
history  ht,  with  elements: 

Ti(ht)  =  P  [rg  |  do  (rg)  ,  ht] 

The  key  idea  behind  a  PSR  is  that  if  we  know  the  expected  outcomes  of  executing  all  possible 
tests,  then  we  also  know  everything  there  is  to  know  about  the  state  of  a  dynamical  system  [93]. 

Knowing  the  success  probabilities  of  some  tests  may  allow  us  to  compute  the  success  prob¬ 
abilities  of  other  tests.  That  is,  given  a  test  Ti  and  a  prediction  vector  r(ht),  there  may  exist  a 
prediction  function  fn  such  that  P  [rf  j  do  (rf)  ,  ht ]  =  fTl  (r(hf)).  In  this  case,  we  say  r(ht)  is 
a  sufficient  statistic  for  P  [rf  \  do  (ry4)  ,  ht ] . 

3.2.2  Linear  Predictive  State  Representations 

Formally,  a  PSR  is  a  tuple  ( O ,  A,  <2,  J7,  X\, ).  O  is  the  set  of  possible  observations  and  A  is  the 
set  of  possible  actions.  Q  is  a  core  set  of  tests,  ay  G  Q  i.e.,  a  set  whose  prediction  vector  x(ht )  is 
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a  sufficient  statistic  for  the  prediction  of  all  tests  77  at  time  t.  IF  is  the  set  of  prediction  functions 
fTl  for  all  tests  77  (which  must  exist  since  Q  is  a  core  set),  and  x\  =  x(h\)  is  the  initial  prediction 
vector  after  seeing  the  empty  history  h\.  In  this  work  we  will  restrict  ourselves  to  linear  PSRs, 
in  which  all  prediction  functions  are  linear:  fTl(x(ht))  =  gj: x ( h, )  for  some  vector  gTl  e  Wl. 

Since  x(ht)  is  a  sufficient  statistic  for  all  tests,  it  is  a  state  for  our  PSR:  i.e.,  for  each  time 
t  we  can  remember  just  x(ht)  instead  of  the  history  ht  itself.  After  taking  action  a  and  seeing 
observation  o,  we  can  update  x(ht)  recursively  by  extending  to  a  joint  probability  distribution 
of  the  current  observation  and  the  next  state  and  then  conditioning  on  the  probability  of  the 
observation. 

To  see  how  this  works,  consider  the  following  example.  In  a  linear  PSR  we  can  predict  the 
success  of  any  test  77  at  time  t  +  1  extended  at  the  beginning  by  an  action  a  and  an  observation 
o,  which  we  call  0077,  as  a  linear  function  of  our  core  test  predictions  x(ht ): 

p  K+i,  °t  =  o  |  do  (t$+1,  <h  =  a )  ,  ht ]  =  gJOTlx(ht)  (3.2) 

If  we  write  Mao  for  the  matrix  with  rows  gJJ0X  for  each  test  Xf  e  Q,  then  we  can  write  the  joint 
probability  of  the  next  state  and  current  observation  as: 

P  [xf+1,ot  =  O  I  do  ( xf+1,at  =  a)  ,  ht\  =  Maox(ht )  (3.3) 

Additionally,  given  a  normalization  vector  ^oo  e  Rd,  defined  by 

[xf  |  do  (xf)  ,ht]  =  1  (Vf) 

==►  xloX(ht)  =  1  (Vf)  (3.4) 

the  PSR  predicts  that  observation  o  will  occur  at  time  t  with  probability 

P  [ot  =  o  |  do  (at  =  a) ,  ht]  =  x^P  [xf+l,  ot  =  o  |  do  (xf+1,  at  =  a)  ,  ht\ 

=  x^Maoxiht)  (3.5) 

Note  that  here  the  normalization  vector  is  marginalizing  out  the  probability  of  the  state  from 
the  joint  probability  P  \xf+1,  ot  —  o  |  do  (xf+1,  a,  =  0)  ,  ht\ .  After  seeing  observation  o.  Equa¬ 
tions  3.3  and  3.5  allow  us  to  implement  the  conditioning  step  and  update  x(ht)  recursively  using 
Bayes’  Rule: 


x(ht+ 1)  =  P  [xf+1  |  do  (xf+1)  ,  ht+ 1] 

=  P  [xf+1  |  do  (xf+l,  at  =  a)  ,ot  =  o,  ht] 

_  p  \xf+ 1,  ot  =  O  |  do  (xf+1,  at  =  a )  ,  ht] 
p  [ot  =  O  I  ht,  at  =  a] 

Mgoxjht) 

~  xlMaox(ht)  K  ; 

We  note  that  PSR  parameters  are  only  determined  up  to  a  similarity  transform:  let  S  G  Rdxd  be 
invertible.  Then,  if  we  replace  Mao  —7  SMaoS~l ,  x(h  \ )  Sx(h  1),  and  x^  — y  S~  x^,  the 
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Figure  3.1:  A  simple  3-state  2-observation  HMM.  An  example  of  a  set  of  indicative  events  (a 
mutually  exclusive  and  exhaustive  partition  of  histories)  for  this  HMM  is  H  =  {c>i02,  O1O1,  o2}. 
(E.g.,  01O2  means  that  the  observation  at  time  t  —  1  is  0\,  and  the  one  at  time  t  is  o2.)  An  example 
of  a  set  of  tests  for  this  HMM  is  T  =  (oio2,  01O1,  o2oi,  o2o2}.  (In  this  case,  c»io2  means  that  the 
observation  at  time  t  +  1  is  0\,  and  the  one  at  time  t  +  2  is  o2.) 


resulting  PSR  makes  exactly  the  same  predictions  as  the  original  one.  (The  pairs  S^S  cancel  in 
Equations  3.5  and  3.6) 

An  important  advantage  of  PSRs  is  that  they  are  a  natural  match  for  fast,  statistically  consis¬ 
tent,  spectral  learning  algorithms.  In  particular,  we  can  relate  a  PSR’s  parameters  to  moments 
of  directly-observable  statistics,  and  prove  bounds  on  the  ranks  of  matrices  of  these  moments. 
These  facts  suggest  a  simple  algorithm:  estimate  the  moments  from  a  sample,  enforce  the  rank 
constraints  by  projection,  and  then  solve  for  estimates  of  the  PSR  parameters.  Based  on  this  intu¬ 
ition,  we  give  here  a  spectral  algorithm  for  learning  PSRs  based  on  an  obsen’able  representation 
of  a  PSR,  which  we  now  define. 


3.3  Observable  Representations 

Specifying  a  PSR  involves  first  finding  a  core  set  of  tests  Q,  called  the  discovery  problem,  and 
then  finding  the  parameters  Mao,  xx,  and  x\  for  these  tests,  called  the  learning  problem.  A  core 
set  Q  for  a  linear  PSR  is  said  to  be  minimal  if  the  tests  in  Q  are  linearly  independent  [44,  93], 
i.e.,  no  one  test’s  prediction  is  a  linear  function  of  the  other  tests’  predictions.  Historically,  the 
discovery  problem  was  usually  solved  by  searching  for  linearly  independent  tests  by  repeatedly 
performing  Singular  Value  Decompositions  (SVDs)  on  matrices  of  observations  from  collections 
of  tests  [123].  The  learning  problem  was  then  solved  by  regression. 

We  introduce  a  new  learning  algorithm  based  on  an  observable  representation  of  a  PSR,  that 
is,  one  where  each  parameter  corresponds  directly  to  observable  quantities.  This  representation 
depends  on  a  core  set  of  tests  T  (not  necessarily  minimal).  It  also  depends  on  a  set  H  of  indicative 
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events ,  that  is,  a  mutually  exclusive  and  exhaustive  partition  of  the  set  of  all  possible  histories. 
We  will  assume  H  is  sufficient  (defined  below).  Both  \T\  and  77  may  be  arbitrarily  larger  than 
Q|;  this  property  makes  it  easier  to  pick  T  and  H  satisfying  our  conditions,  since  we  are  free  to 
choose  sets  that  we  believe  to  be  large  enough  and  varied  enough  to  exhibit  the  types  of  behavior 
that  we  wish  to  model.  Figure  3.1  illustrates  an  example  of  tests  and  histories  in  a  simple  HMM. 

For  purposes  of  gathering  data,  we  assume  that  we  can  sample  from  some  sufficiently  diverse 
distribution  u>  over  histories;  our  observable  representation  depends  on  uj  as  well.  We  will  define 
“sufficiently  diverse”  below,  but  for  example,  u  might  be  the  steady-state  distribution  of  some 
exploration  policy.  Note  that  this  assumption  means  that  we  cannot  estimate  x \ ,  since  we  don’t 
have  multiple  samples  of  trajectories  starting  from  x\.  So,  instead,  we  will  estimate  x*,  an 
arbitrary  feasible  state,  which  is  enough  information  to  enable  prediction  (assuming  the  process 
mixes  at  a  reasonable  rate).  If  we  make  the  stronger  assumption  that  we  can  repeatedly  reset  our 
PSR  to  its  starting  distribution,  a  straightforward  modification  of  our  algorithm  will  allow  us  to 
estimate  x\  as  well. 

We  define  several  observable  matrices  in  terms  of  T,  77,  and  u.  After  each  definition  we  show 
how  these  matrices  relate  to  the  parameters  of  the  underlying  PSR.  These  relationships  will  be 
key  tools  in  designing  our  learning  algorithm  and  showing  its  consistency.  The  first  observable 
matrix  is  I  f  G  containing  the  probabilities  of  every  event  H,  G  77  when  we  sample  a 
history  h  according  to  to: 


=  Pw  [h  e  Hi] 

^  Pn  =  F[H]  (3.7a) 

Here  we  have  defined  P [H]  to  mean  to  mean  the  vector  whose  elements  are  Pj //,.]  for  Hi  G  H. 
Next  we  define  Px.n ■>  the  joint  probability  matrix  of  states  and  indicative  events: 

[Px,H]ij  =  ®«[x(ht)V[Hj,t  |  ht]] 

=►  Px,n  =  E*  [x(ht) P  [Hj  |  ht]]  (3.7b) 

We  cannot  directly  estimate  Px;h  from  data,  but  this  matrix  plays  a  fundamental  role  in  our 
derivations  and  will  appear  as  a  factor  in  several  of  the  matrices  that  we  define  below.  One  of  the 
key  functions  of  Px.n  is  that  it  can  be  simply  related  to  x*  and  I  f  when  h  ~  cu.  The  following 
relations  will  be  important  for  some  of  our  derivations  below: 

Ph  =  E  [HT  |  ht] 

=  E  [x^x(/it)i7T  |  ht] 

=  [x{ht)HT  |  ht] 

=  xIqPxM 

and 

x*  =  Px,n 1 
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where  1  gM^  is  the  vector  of  Is  that  marginalizes  out  indicative  events.  This  definition  works 
because  it  sets  x *  to  be  the  mean  prediction  given  any  history. 

The  next  observable  matrix  is  Pr,n  £  #|x|«l  a  covariance  matrix  whose  entries  are  joint 
probabilities  of  tests  t*  G  T  and  indicative  events  //7  e  TL  when  we  sample  ht  ~  c o  and  take 
actions  t/J: 


[pT,u\ij  =  E  |  do  (t$] 

=  E[P  [T°,Hj>t  |  ht,  do  (7$]] 

=  E[P  [Tgl^do  K)]P[4|ftJ] 

=  E[r,(/if)P[%  |  /i*]] 

—  E  \glix{ht)f‘  [Hj:t  |  /it]] 

=  gJiE[x(ht)F[Hj,t\ht]] 

=  9nPXMj 

pr,n  —  r Px,u  (3-7c) 

Here  the  vector  gTi  lets  us  compute  the  probability  of  test  t,  given  the  state,  and  T  e  PJTIXIQI  is 
the  matrix  with  rows  gj .  Equation  (3.7c)  tells  us  that  the  rank  of  Pr,n  is  no  more  than  |  Q\,  since 
its  factors  T  and  Px,h  each  have  rank  at  most  \Q\,  At  this  point  we  can  define  a  sufficient  set  of 
indicative  events  and  a  sufficiently  diverse  sampling  distribution  as  promised:  they  are  a  set  of 
indicative  events  and  a  sampling  distribution  for  which  the  rank  of  Pr,n  is  equal  to  \Q\. 

The  final  observable  matrices  are  Pj-  ,ao,n  *=  P3TI  x  ^ ,  one  matrix  for  each  action-observation 
pair.  Entries  of  Pr+,ao,H  arc  probabilities  of  triples  of  an  indicative  event,  the  next  action- 
observation  pair,  and  a  subsequent  test,  if  we  execute  a: 


P 


{PT*„,Hh  =  E  bft+ il(<*  =  o)Hjt  I  do  (at,  T-J+1)] 

=  E  [P  [t°,+1,o,,Hu  |  ht,  do  («i,0+1)]] 

=  E  [P  [r°+1,o,  |  ht.Ao  (a„Pw)]V[Hlt  \  h ,]] 

=  E  [P  [t°+1  |  ht,  ot,  do  (a,,  t-J+1)]  P  [o,  |  h,,  a,]  P  [Hjf  \  h,}] 
=  E  [n(ht+i)F[ot  |  htlat]¥[Hj}t  \  ht]\ 

=  E  [gJ.x(ht+1)¥[ot  j  ht,  at] P  [Hj>t  \  ht]] 


=  gjM 


Ph ao'x  ( ht. 


r+,ao,n 


L  x  oc  Maox  ( ht 
=  gl E  [Maox{ht)V[Hht  |  ht]] 
=  glMaoE[x(ht)F[HJtt  \  ht}] 
=  gl.MaoPx,Hj 
=  TMaoPx,n 


P[ot  |  ht,at] P  [Hjjt  |  ht 


(3.7d) 


Here  we  use  Equation  3.6  and  the  fact  that  P  [ot  \  ht,  a,]  =  xJ^Maox(ht).  Just  like  Pt:h,  the 
matrices  Pr+,ao,n  have  rank  at  most  |  Q\  due  to  their  factors  T  and  Px,n- 
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Recall  from  Section  3.2.2  that  PSRs  are  only  defined  up  to  a  similarity  transform.  This 
means  that  we  can  leam  any  similarity  transform  of  the  PSR  parameters.  To  exploit  this  fact  we 
assume  that  we  are  given  an  additional  matrix  U  €  Id  such  that  UTY  is  invertible.  (Or, 
equivalently  given  our  assumptions  above,  such  that  UT Pr,n  has  full  row  rank.)  A  natural  choice 
for  U  is  the  leading  left  singular  vectors  of  Pt,h>  although  a  randomly-generated  U  will  work 
with  probability  1  (but  will  typically  result  in  slower  learning).  We  can  think  of  the  columns  of 
U  as  specifying  a  core  set  Q!  of  linear  combinations  of  core  tests  Q  which  define  a  state  vector 
for  our  PSR. 

We  now  define  a  PSR  in  terms  of  the  observable  matrices  PH,  Pt,h,  Pr+,ao,n  and  U,  and 
simplify  the  definitions  using  Equations  3.7(a-c)  to  show  that  our  parameters  are  only  a  similarity 
transform  away  from  the  original  PSR  parameters: 

b, *  =  UTPT,Hl 
=  UTYPX,H1 

=  (UT  Y)x*  (3.8a) 

bl  =  Pn(UTPT,ny 

=  xlPX,H(UTPT,ny 

=  xl{UTY)-l{UTY)Px^{UT  PTiW)t 

=  xtoc(UtY)-1  (3.8b) 

Bao  ^  UTPT+,ao,H(UTPT,Hy 

=  UTYMaoPXtH(UTPT,Hy 
=  UTYMao(UTY)-\UTY)PXtH(UT  Pr,Hy 
=  (UTY)Mao(UTY)-1  (3.8c) 

To  get  bi  =  ( UTY)xi  instead  of  6*  in  Equation  3.8a,  replace  Pt,h1,  the  vector  of  expected 
probabilities  of  tests  for  ht,  ~  to,  with  the  vector  of  probabilities  of  tests  for  ht  =  hi. 

Given  the  parameters  in  Equations  3.8a-c,  we  can  calculate  the  probability  of  observations 
oi:t,  given  that  we  start  from  state  x\  and  intervene  with  actions  a\:t.  Here  we  write  the  product 
of  a  sequence  of  transition  matrices  as  Maoi:t  =  Mai01Ma202 . . .  Mat0t. 

P[oi:t  |  do(ai:t)]  =  xJ0Maoi.tx1 

=  xl(UTY)-\UTY)Maoi,(UTY)-\UTY)xi 

=  bJoBao1:tbi  (3.9) 


In  addition  to  the  initial  PSR  state  b\,  we  define  normalized  conditional  internal  states  bt.  We 
define  the  PSR  state  at  time  t  +  1  as: 


7  def 

bt+ 1  — 


Ba0l:tbl 

b^oBaoi;tbl 


(3.10) 


We  can  define  a  recursive  state  update  from  time  t  >  1  to  t  +  1  as  follows  (using  the  above 
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definition  of  fq  as  the  base  case): 


7  def 

h+i  — 


Ba0l:M 

bl  Baoi:tbi 


Batot  B  aoi-.t-ibl 
^oo  Batot  Ba<n,-A 


Bat  ot 

^oo  Batot  bt 


(3.11) 


If  we  only  have  6*  instead  of  b\,  then  initial  probability  estimates  will  be  approximate,  but  the 
difference  in  predictions  will  disappear  over  time  as  our  process  mixes.  The  linear  transform 
from  bt  to  a  PSR  internal  state  x(ht)  is  given  by  x(ht)  =  (f/TT)_16t.  If  we  chose  U  by  SVD, 
then  the  prediction  of  tests  r(ht)  at  time  t  is  given  by  Ubt  =  UUTTx(ht )  =  Tx(ht)  (since  by  the 
definition  of  SVD,  U  is  orthonormal  and  the  row  spaces  of  U  and  T  are  the  same). 


3.4  Learning  Predictive  State  Representations 

Our  learning  algorithm  works  by  building  empirical  estimates  P%,  Pt,h>  an(i  Pr+,ao,H  of  the 
matrices  PH,  Pr,n >  an(i  Pr+,ao,u  defined  above.  To  build  these  estimates,  we  repeatedly  sample 
a  history  ht  from  the  distribution  u,  execute  a  sequence  of  actions  from  one  of  our  tests,  and 
record  the  resulting  observations.  This  data  gathering  strategy  implies  that  we  must  be  able  to 
arrange  for  the  system  to  be  in  a  state  corresponding  to  ht  ~  oo;  for  example,  if  our  system  has  a 
reset,  we  can  take  ui  to  be  the  distribution  resulting  from  executing  a  fixed  exploration  policy  for 
a  few  steps  after  reset. 

In  practice,  reset  is  often  not  available.  In  this  case  we  can  estimate  Pn,  Pr,u ,  and  Pr+,ao,n 
by  dividing  a  single  long  sequence  of  action- observation  pairs  into  subsequences  and  pretending 
that  each  subsequence  started  with  a  reset.  We  are  forced  to  use  an  initial  distribution  over 
histories,  u>,  equal  to  the  steady  state  distribution  of  the  policy  which  generated  the  data.  This 
approach  is  called  the  suffix-history  algorithm  [123].  With  this  method,  the  estimated  matrices 
will  be  only  approximately  correct,  since  interventions  that  we  take  at  one  time  will  affect  the 
distribution  over  histories  at  future  times;  however,  the  approximation  is  often  a  good  one  in 
practice,  especially  if  we  allow  the  process  to  mix  by  executing  several  steps  of  the  exploration 
policy  in  between  interventions. 

If  we  know  or  can  estimate  the  exploration  policy,  we  can  avoid  making  any  interventions, 
and  instead  use  importance  weighting  [16]  to  produce  samples  whose  expectations  are  the  true 
test  outcome  probabilities.  We  summarize  this  trick  here  for  the  special  case  of  an  open-loop 
randomized  exploration  policy.  Write  rf4,  r^4, . . .  for  the  action  sequences  mentioned  in  our 
tests.  For  simplicity,  assume  these  sequences  are  distinct  (if  not,  we  can  merge  duplicates).  Let 
Ci,  (2,  •  •  •  >  0  be  the  probabilities  of  tj4,  r.f, . . .  under  the  exploration  policy.  Starting  at  some 
time,  suppose  the  exploration  policy  actually  executes  the  action  sequence  rf.  Write  5  for  the 
random  variable  which  is  1  if  the  corresponding  observations  match  rf,  and  0  if  not.  Now 
construct  a  sample  vector  which  is  zero  everywhere  except  that  the  ?’th  component  is  d/Q.  It  is 
easy  to  see  that  the  expected  value  of  our  sample  vector  is  correct,  i.e.,  that  the  expectation  of 
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the  ith  component  is  the  true  success  probability  of  t,;  the  probability  of  selection  Q  and  the 
weighting  factor  1/Q  cancel  out. 

Once  we  have  computed  Pn,  Pr,n >  and  Pt+,uo,H’  we  can  generate  U  by  singular  value  de¬ 
composition  of  P't :h ■  We  can  then  learn  the  PSR  parameters  by  plugging  U,  P%,  Pt,h ,  and 
1\  \  ao.n  into  Equation  3.8(a-c).  For  reference,  we  summarize  the  above  steps  here: 

1.  Compute  empirical  estimates  P%,  .ao.n- 

2.  Use  SVD  on  Pr,n  10  compute  U,  the  matrix  of  left  singular  vectors  corresponding  to  the 
n  largest  singular  values. 

3.  Compute  model  parameter  estimates: 

(a)  6*  =  UTPT,nl, 

(b)  T>oc  =  (Rf>nuypH, 

(c)  Bao  =  UtPt+^h(UtPtmY 

As  we  include  more  data  in  our  averages,  the  law  of  large  numbers  guarantees  that  our  estimates 
P-u,  Pt,h ,  and  Pr+,ao,n  converge  to  the  true  matrices  Ph,  Pt,h,  and  Pr+,ao,n  (defined  in  Equa¬ 
tion  3.7).  So  by  continuity  of  the  formulas  above,  if  our  system  is  truly  a  PSR  of  finite  rank,  our 
estimates  6*,  b^,  and  Bao  converge  to  the  true  parameters  up  to  a  linear  transform — that  is,  our 
learning  algorithm  is  consistent .*  Although  parameters  estimated  with  finite  data  can  sometimes 
lead  to  negative  probability  estimates  when  filtering  or  predicting,  this  problem  can  be  avoided 
in  practice  by  thresholding  the  predicted  probabilities  by  some  small  positive  probability. 

The  learning  strategy  employed  here  may  be  seen  as  a  generalization  of  Hsu  et  al.’s  spectral 
algorithm  for  learning  HMMs  [42]  to  PSRs.  Since  HMMs  and  POMDPs  are  a  proper  subset 
of  PSRs,  we  can  use  the  algorithm  in  this  chapter  to  learn  back  both  HMMs  and  POMDPs  in 
PSR  form.  However,  the  problem  of  learning  HMMs  and  POMDPs  in  general  is  hard  under 
cryptographic  assumptions  [52,  109].  Therefore,  some  models  will  require  a  large  amount  of 
data  (and  thus  a  large  amount  of  computation)  to  leam  exactly.  The  learning  algorithm  presented 
here  embodies  a  tradeoff:  it  relinquishes  the  ability  to  leam  very  difficult  HMMs  with  little  data 
and  a  lot  of  computation,  but  is  a  very  effective  learning  algorithm  for  easier  HMMs. 

Since  uncontrolled  PSRs  are  equivalent  to  OOMs,  the  learning  algorithm  presented  here  can 
also  be  used  to  efficiently  learn  OOMs.  In  fact,  the  naive  OOM  learning  algorithm  [44]  is  similar 
to  our  PSR  learning  algorithm,  but  uses  a  fixed  subspace  rather  than  employing  SVD  to  choose 
a  subspace.  The  more  sophisticated  efficiency  sharpening  algorithm  [38]  is  an  iterative  way  to 
choose  a  subspace  for  OOMs  that  results  in  more  statistically  efficient  estimates  than  the  naive 
algorithm. 

Finally,  note  that  the  learning  algorithm  presented  here  is  distinct  from  the  TPSR  learning 
algorithm  of  [84] .  In  addition  to  the  differences  mentioned  above,  a  key  difference  between  the 

'The  pseudoinverses  are  continuous  at  the  true  parameters,  since  the  matrices  to  be  pseudoinverted  have  full 
rank.  The  matrix  of  n  left  singular  vectors  U  may  not  be  a  continuous  function  of  Pt.h  (in  case  of  repeated  singular 
values);  to  deal  with  this  possibility,  we  can  either  fix  U  (say,  as  the  left  singular  vectors  of  our  estimated  Pt,h  after 
some  fixed  amount  of  data),  or  we  can  make  a  slightly  more  complex  argument  based  on  the  fact  that  the  column 
span  of  U  is  a  continuous  function  of  Pt.h  near  Pt.  h  (since  the  nth  singular  value  of  Pt.  h  is  nonzero,  and  is 
therefore  separated  from  the  (n  +  l)st,  which  is  zero). 
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two  algorithms  is  that  here  we  estimate  the  joint  probability  of  a  past  event,  a  current  observa¬ 
tion,  and  a  future  event  in  the  matrix  Pr+,ao,n >  whereas  [84]  instead  estimate  the  probability  of  a 
future  event,  conditioned  on  a  past  event  and  a  current  observation.  To  compensate,  Rosencrantz 
et  al.  later  multiply  this  estimate  by  an  approximation  of  the  probability  of  the  current  observa¬ 
tion,  conditioned  on  the  past  event.  Rosencrantz  et  al.  also  derive  the  approximate  probability 
of  the  current  observation  differently:  as  the  result  of  a  regression  instead  of  directly  from  em¬ 
pirical  counts.  Finally,  Rosencrantz  et  al.  do  not  make  any  attempt  to  multiply  by  the  marginal 
probability  of  the  past  event,  although  this  term  cancels  in  the  current  work.  In  the  absence  of  es¬ 
timation  errors,  both  algorithms  would  arrive  at  the  same  answer,  but  taking  errors  into  account, 
they  will  typically  make  different  predictions.  The  difficulty  of  extending  the  Rosencrantz  et  al. 
algorithm  to  handle  real- valued  features  stems  from  the  difference  between  joint  and  conditional 
probabilities:  the  observable  matrices  in  Rosencrantz  et  al.  are  conditional  expectations,  so  their 
algorithm  depends  on  being  able  to  condition  on  discrete  indicative  events  or  observations. 
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Chapter  4 

Learning  Predictive  State  Representations 
with  Features 


In  this  chapter  we  generalize  our  observable  representation  of  PSRs  in  Chapter  3,  which  were 
built  from  joint  probabilities  of  discrete  observations,  to  expectations  of  continuous  features. 
In  data  gathered  from  complex  real-world  dynamical  systems,  it  may  not  be  possible  to  find  a 
reasonably-sized  core  set  of  discrete  tests  T  or  sufficient  set  of  indicative  events  7~L.  When  this  is 
the  case,  we  can  generalize  the  PSR  learning  algorithm  and  work  with  features  of  test  outcomes 
and  histories,  which  we  call  characteristic  features  and  indicative  features  respectively.  Simi¬ 
larly,  if  we  have  a  large  number  of  discrete  observations  and  actions,  the  number  of  parameters  in 
the  PSR  can  become  large  enough  to  make  learning  practically  impossible.  When  this  is  the  case, 
we  can  generalize  the  PSR  to  recursively  update  state  based  on  observation  features  through  a 
matrix  representation  of  Bayes’  rule. 

We  will  discuss  features  in  two  steps.  First  we  will  define  PSRs  in  terms  of  moments  of  ob¬ 
servable  features  of  tests  and  histories.  This  extension  has  important  practical  consequences:  we 
often  have  domain  knowledge  that  allows  us  to  choose  features  that  “make  sense”  for  the  system 
we  are  trying  to  model,  and,  choosing  a  small  set  of  features  usually  means  that  the  learning  al¬ 
gorithm  converges  at  a  faster  rate  in  practice.  Second,  we  generalize  the  PSR  Bayes’  rule  update 
to  features  of  observations.  In  this  chapter  we  will  focus  on  learning  PSRs  that  contain  a  large  but 
finite  number  of  discrete  actions,  observations,  and  features.  We  will  generalize  to  continuous 
actions  and  observations,  and  infinite  feature  spaces,  in  Chapter  5. 


4.1  Characteristic  and  Indicative  Features 

4.1.1  Characteristic  Features 

We  can  generalize  the  notion  of  a  test  from  Section  3.2.1  to  a  characteristic  feature,  a  feature  of 
the  future  that  is  a  linear  combination  of  several  tests  sharing  a  common  action  sequence.  These 
features  are  called  characteristic  because  the  expectation  of  the  features  fully  characterizes  the 
distribution  of  the  future.  For  example  of  a  characteristic  feature:  if  T\  and  r2  are  two  tests  with 
rf  =  r.f  =  ta,  then  we  can  make  a  feature  d  =  3ri  +  r2.  This  feature  is  executed  if  we 
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intervene  to  do(r4),  and,  if  it  is  executed,  its  value  is  3I(r1c>)  +  I (r®),  where  I(oi . . .  o*)  stands 
for  an  indicator  random  variable,  taking  the  value  0  or  1  depending  on  whether  we  observe  the 
sequence  of  observations  oi . . .  o,:.  The  prediction  of  0  given  ht  is  (f>(ht)  =  E  [0  |  ht)  do  (r4)]  = 

3ti  (ht)  +  T2(ht ). 


While  linear  combinations  of  tests  may  seem  restrictive,  our  definition  is  actually  very  ex¬ 
pressive:  we  can  represent  an  arbitrary  function  of  a  finite  sequence  of  future  observations.  We 
could  also  allow  convergent  limits  of  features  (i.e.,  take  the  closure  of  the  set  of  features),  mean¬ 
ing  that  we  could  represent  many  functions  of  the  entire  infinite  sequence  of  future  observations. 
Another  view  of  this  is  that  we  could  approximate  a  feature  of  the  infinite  sequence  arbitrarily 
closely.  To  build  a  feature,  we  take  a  collection  of  tests,  each  of  which  picks  out  one  possi¬ 
ble  realization  of  the  sequence,  and  weight  each  test  by  the  value  of  the  function  conditioned 
on  that  realization.  For  example,  if  our  observations  are  integers  1,2,...,  10,  we  can  write  the 
square  of  the  next  observation  as  ar|d  the  mean  of  the  next  two  observations  as 

T,ZiT^=A(o  +  cf)i(o,</). 

The  restriction  to  a  common  action  sequence  is  necessary:  without  this  restriction,  all  the 
tests  making  up  a  feature  could  never  be  executed  at  once.  Once  we  move  to  feature  predictions , 
however,  it  makes  sense  to  lift  this  restriction:  we  will  say  that  any  linear  combination  of  fea¬ 
ture  predictions  is  also  a  feature  prediction,  even  if  the  features  involved  have  different  action 
sequences. 

Action  sequences  raise  some  problems  with  obtaining  empirical  estimates  of  means  and  co- 
variances  of  features  of  the  future:  e.g.,  it  is  not  always  possible  to  get  a  sample  of  a  particular 
feature’s  value  on  every  time  step,  and  the  feature  we  choose  to  sample  at  one  step  can  restrict 
which  features  we  can  sample  at  subsequent  steps.  In  order  to  carry  out  our  derivations  without 
running  into  these  problems  repeatedly,  we  will  assume  for  the  rest  of  the  chapter  that  we  can  re¬ 
set  our  system  after  every  sample,  and  get  a  new  history  independently  distributed  as  ht  ~  lo  for 
some  distribution  u.  (With  some  additional  bookkeeping  we  could  remove  this  assumption  [16], 
but  this  bookkeeping  would  unnecessarily  complicate  our  derivations.) 

Furthermore,  we  will  introduce  some  new  language,  again  to  keep  derivations  simple:  if  we 
have  a  vector  of  features  of  the  future  0r,  we  will  pretend  that  we  can  get  a  sample  d>[  in  which 
we  evaluate  all  of  our  features  starting  from  a  single  history  ht,  even  if  the  different  elements  of 
0r  require  us  to  execute  different  action  sequences.  When  our  algorithms  call  for  such  a  sample, 
we  will  instead  use  the  following  trick  to  get  a  random  vector  with  the  correct  expectation  (and 
somewhat  higher  variance,  which  doesn’t  matter  for  any  of  our  arguments):  write  r4 , r4 , . . . 
for  the  different  action  sequences,  and  let  C i ,  C2 ,  -  -  -  >  0  be  a  probability  distribution  over  these 
sequences.  We  pick  a  single  action  sequence  r4  according  to  (,  and  execute  r4  to  get  a  sample 
0r  of  the  features  which  depend  on  r4.  We  then  enter  <i)T /Q,  into  the  corresponding  coordinates 
of  <f[ ,  and  fill  in  zeros  everywhere  else.  It  is  easy  to  see  that  the  expected  value  of  our  sample 
vector  is  then  correct:  the  probability  of  selection  Qa  and  the  weighting  factor  1/(0  cancel  out. 
We  will  write  E  \(iP  \  ht,  do((j]  to  stand  for  this  expectation. 
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4.1.2  Indicative  Features 


Just  as  we  generalized  the  notion  of  test  to  characteristic  features,  we  can  generalize  the  notion 
of  a  history  to  indicative  features.  Such  features  are  called  indicative  because  they  indicate  what 
has  already  happened.  An  indicative  feature  is  a  function  of  the  history  ht.  We  have  already  seen 
a  specific  type  of  indicative  feature  in  Section  3.3,  an  indicative  event,  which  is  one  of  a  set  of 
mutually  exclusive  and  exhaustive  partitions  of  history.  Other  indicative  features  might  reference 
the  number  of  times  we  saw  oi  in  the  past  three  steps;  or,  in  a  domain  with  actions,  indicative 
features  can  reference  past  actions:  e.g.,  the  number  of  times  we  did  action  a  \  in  the  past  three 
steps. 


4.2  Defining  PSRs  with  Characteristic  and  Indicative  Features 

Let  Q  be  a  minimal  core  set  of  tests  for  a  dynamical  system,  with  cardinality  d  —  \Q\  equal 
to  the  linear  dimension  of  the  system.  Then,  let  T  be  a  larger  core  set  of  tests  (not  necessarily 
minimal,  and  possibly  even  with  T|  countably  infinite).  And,  let  %  be  the  set  of  all  possible 
histories.  (\H\  is  finite  or  countably  infinite,  depending  on  whether  our  system  is  finite-horizon 
or  infinite-horizon.) 

We  write  of  G  for  a  vector  of  indicative  features  of  history  at  time  t,  and  write  of  G  B/'T 
for  a  vector  of  characteristic  features  of  the  future  at  time  t.  Since  T  is  a  core  set  of  tests,  by 
definition  we  can  compute  any  test  prediction  rfhf)  as  a  linear  function  of  the  tests  in  T.  And, 
since  feature  predictions  are  linear  combinations  of  test  predictions,  we  can  also  compute  any 
feature  prediction  E  [ff  |  do((),  ht\  as  a  linear  function  of  the  tests  in  T.  We  define  the  matrix 
$r  G  to  embody  our  predictions  of  future  features:  that  is,  an  entry  of  <f>r  is  the  weight 

of  one  of  the  tests  in  T  for  calculating  the  prediction  of  one  of  the  features  in  oT. 

Below  we  define  several  matrices,  'oT:H,  and  T.t-mo  -h,  in  terms  of  characteristic  and 
inidicative  features  of  and  of  and  discrete  actions  and  observations  at,  and  ot,  and  show  how 
these  matrices  relate  to  the  parameters  of  the  underlying  PSR. 

These  matrices  are  the  analog  of  the  matrices  Pu,  I  f:hh  and  Pr+,ao,H  'n  Equations  3.7, 
but  will  no  longer  contain  probabilities,  but  rather  expected  values  of  features  or  products  of 
features.  For  the  special  case  of  features  that  are  indicator  functions  of  test  outcomes  and  sets  of 
histories,  we  recover  the  probability  matrices  from  Section  3.3.  And,  just  as  in  Section  3.3,  we 
can  estimate  expected  values  from  repeated  trials,  or  from  one  long  sequence;  and,  we  can  use 
importance  sampling  to  avoid  having  to  make  any  interventions. 

We  need  to  require  a  version  of  sufficiency  for  our  sets  of  features,  as  we  did  for  tests  and  in¬ 
dicative  events  above.  The  updated  definition  of  sufficiency  is  analogous  to  our  earlier  definitions 
of  core  tests  and  sufficient  indicative  events:  we  require  that  the  rank  of  'o-tm  (defined  below  in 
Equation  4.1c)  is  equcd  to  the  linear  dimension  of  the  system.  The  advantage  of  working  with 
features  instead  of  events  is  that,  in  practice,  it  seems  to  be  easier  to  ensure  sufficiency  with  a 
moderate  number  of  features. 
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We  begin  by  defining  the  analog  of  Pw  from  Equation  3.7a: 


M  =  e  [$] 

=>/^  =  E  [<ff]  (4.1a) 

Next  we  define  E x,h,  the  cross  covariance  of  states  and  features  of  histories.  Let 

E x,n  =  E  [rfT 

=  E  \x(ht) E  Uf  |  ht]  1  (4.1b) 

Like  Px,n  in  Equation  3.7b,  we  cannot  directly  estimate  E x,n  from  data,  but  this  matrix  will 
appear  as  a  factor  in  several  of  the  matrices  that  we  define  below.  We  can  see  that  E.v ,n  is  related 
to  /in: 

xIo^x,h  =  x(ht)E  |  ht 

=  E  x^xih^E  |  ht 

=  E  [e  |  ht 1 
=  E  [cftT] 

=  hH1 

Next  we  define  E7-/H,  the  cross  covariance  matrix  of  the  features  of  tests  and  histories: 

=  E  [4>Ul  \  do(C)] 

=  E[E[^«  |  do  (C) ,  Ml 

=  E[EK,  |do(C),fc,]E[^«  |  ft,]] 

'm 

=  E  EtPlft‘)EK<lft<] 

Z=1 
\T\ 

=  5>5E  I  Ml 

«=i 

IP 

=  E45E  [gJ,x(h,)E  [0“,  I  ft,]] 

;=i 

|T| 

=  E*MeNMe  [<,!*>.]] 

t=i 

=  *JTE  [x(ht)E  [<%t  I  /ij] 

=>  Er,«  =  $rrEXiW  (4.1c) 
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As  in  Chapter  3,  the  vector  gTl  is  the  linear  function  that  specifies  the  probability  of  the  test 
Tj  given  the  probabilities  of  tests  in  the  core  set  Q,  and  the  matrix  Y  has  all  of  the  gTl  vectors 
as  rows.  The  state  vector  x ( Jg )  contains  the  probabilities  of  all  tests  in  Q  given  history  ht. 
The  above  derivation  shows  that,  because  of  our  assumptions  about  the  linear  dimension  of  the 
system,  the  matrix  has  factors  <hrr  e  IffJTlxn  and  E x,u  Y  Mnx^.  Therefore,  the  rank  of 
E7 -^1  is  no  more  than  d,  the  linear  dimension  of  the  system. 

Finally,  we  define  Er+  ao  -H,  a  set  of  matrices,  one  for  each  action-observation  pair,  that  rep¬ 
resent  the  covariance  between  features  of  history  before  and  after  taking  action  a  and  observing 
o.  In  the  following,  I(ot  =  6)  is  an  indicator  variable  for  whether  we  see  observation  o  at  step  t. 


lN-r  •  .iin.H.j  j  =  E  [^t+1I(ot  =  o)4%t  |  do (at,  C+)] 

=  E  [E  y>Tt+1I(ot  =  o)(j%t  |  do  (at, C+)  , ht]] 

=  E  [E  [</>Tt+1I(ot  =  o)  |  do  (at,  C+)  ,  ht]  E  |  ht]] 

=  E  [E  [<l>Tt+ 1  |  do  (at,  C+)  ,  K  ot]  P  [ot  \  ht,  do  (a*)]  E  [</%t  \  ht]] 


Er+>l 


=  E 


m 


^2®TMht+iW[°t  I  ht,do(at)]E[<f%t  |  ht 


i=i 


\r\ 

=  <^E  [n(ht+ i)P  [ot  |  ht,  do(at)]  E  |  ht]] 

i=i 

\r\ 

=  E  *5E  I  h..do(a,)]E  [$,  |  A,]] 

1=1 

\r\ 

=  y  tJ^E  [x(ftt+1)P  [o«  |  h„ do(a,)]  E  [$,  |  ht}\ 
1=1 

=  TE  [x(ht+1)F  [ot  |  /it,do(at)]E  [<%  j  ht]\ 


=  $fTE 


M-a qX  (  hf 


[  X^Maox(ht) 

=  TE  [Maox(ht) E  |  ht]] 

=  (hj T MaoE  [x(ht) E  |  /it]] 


P[ot  |  htl  do(at)]  E  [^t  I  ht 


=  TrTM_E 


ao^X,Hj 


ao,T-L 


=  $rTM„nE 


ao^X,n 


(4.  Id) 


Our  goal  is  now  to  define  a  PSR  in  terms  of  the  above  moments  of  characteristic  and  indicative 
features.  Similar  to  our  approach  in  Chapter  3,  we  will  need  a  matrix  U  such  that  U  T  rT  is 
invertible  (i.e.  U1  E-^/h.  has  full  row  rank);  we  can  take  U  to  be  the  left  singular  values  of  E r,n- 
We  also  assume  that  we  have  a  vector  e  s.t.  y^Te  =  1  (Vt).  The  existence  of  e  means  that  the 
ones  vector  1T  must  be  in  the  row  space  of  (pn.  Since  (fin  is  a  matrix  of  features,  we  can  always 
ensure  that  this  is  the  case  by  requiring  one  of  our  features  to  be  a  constant. 

The  parameters  of  the  PSR  (&*,  b^,  and  Bao)  are  now  defined  as  follows  in  terms  of  the 
matrices  gu,  E-/-  ^,  E r+,ao,«,  and  U.  After  each  definition  we  simplify  the  expressions  using 
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Equations  4.1a-d  to  show  that  our  parameters  are  only  a  similarity  transform  away  from  the 
original  PSR  parameters: 

b, *  =f  f/TSr^e 

=  [/T$rrs^e 

=  (f/T$rr>*  (4.2a) 

=  xlZx,n(UTXr,n) ] 

=  ^(f/T$rr)-1(f/T$rr)sx,w(f/Tsr,w)t 

=  xJ0(UT$Tr)~1  (4.2b) 

5ao  =  (/TEr+,ao,w(f7TEr,w)t 
=  f/T$rrAEaoEx,w(f/TEr,w)t 

=  [/T$rrMao((/T$rr)-1(ET$rr)sA-w((7Tsr,w)t 
=  (f/T$rr)Mao((/T$rr)-1  (4.2c) 

Just  as  in  Section  3.4  where  we  estimate  U,  Pu,  Pt.u,  and  Pr+,ao,n^  we  can  estimate  U,  E H, 
TiT,h^  and  P‘T+,ao ,n  from  data,  and  then  plug  the  estimates  into  Equations  4.2(a-c).  And,  just 
as  before,  our  estimation  equations  are  continuous  near  the  true  values  of  U,  Ew,  Er  w,  and 
E r+,ao,n -1  Thus  we  see  that  if  we  work  with  characteristic  and  indicative  features,  and  if  our 
system  is  truly  a  PSR  of  finite  rank,  our  estimates  6*,  b^,  and  Bao  again  converge  to  the  true  PSR 
parameters  up  to  a  linear  transform. 

If  we  are  trying  to  model  a  dynamical  system  with  a  small  number  of  discrete  actions  and 
observations,  then  the  above  approach  to  learning  a  PSR  can  be  very  effective.  Characteristic 
and  indicative  features  are  especially  useful  when  we  want  to  summarize  a  large  set  of  tests  and 
history  (and,  therefore,  increase  the  likelihood  that  this  set  is  core).  However,  if  the  number  of 
possible  observations  and  actions  are  very  large,  and  we  only  receive  a  few  or  no  training  points 
for  each  action-observation  pair,  then  the  transition  parameters  Mao  can  be  very  difficult  to  leam. 
When  this  is  the  case  we  use  obsen’ation  features. 

4.3  Observation  Features  and  Bayes’  Rule 

In  this  section,  we  extend  the  above  definitions  and  algorithms  to  handle  a  large  observation  set 
O  by  extending  our  use  of  features  to  features  of  observations  conditioned  on  actions.  Suppose 
that  \0\  is  finite,  but  large  enough  that  we  cannot  hope  to  see  each  observation  more  than  a 
small  number  of  times.  Then  the  representations  of  E7- ^  and  E-r+  ao:H  in  terms  of  PSR  param¬ 
eters  (Equations  4.1(a-d))  are  still  valid,  but  we  cannot  gather  enough  data  to  estimate  Er+  ao  -H 
accurately  unless  we  make  additional  assumptions. 

Therefore  we  generalize  observations  to  obsen’ation  features,  a  feature  of  the  present  that 
is  a  linear  combination  of  observations  conditioned  on  taking  a  common  action.  In  this  way, 

'Again,  with  the  same  caveat  about  the  SVD. 
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observation  features  are  very  similar  to  characteristic  features:  they  must  be  designed  so  that  the 
expectation  of  observation  features  fully  characterizes  the  probability  distribution  of  observations 
at  time  t  for  each  action.  When  obtaining  empirical  estimates  of  expectations  of  observation  fea¬ 
tures  we  must  pay  close  attention  to  actions  in  a  manner  similar  to  estimating  expectations  of 
characteristic  features.  We  incorporate  actions  into  the  observation  features  by  parameterizing 
the  features  with  actions.  That  is,  we  assume  that  we  know  m  features  of  observations  condi¬ 
tioned  on  actions,  4>Af  for  k  =  1 ...  m  at  each  time  t.  We  can  decompose 


^°e 

^k,:  catot 


(4.3) 


where  each  element  of  the  matrix  &AO  e  Mfcxl-4NC)I  contains  the  value  of  a  feature  given  that  we 
take  the  action  and  observe  the  observation  that  indexes  that  element.  The  indicator  vector  eao 
picks  out  the  column  of  $AO  corresponding  to  the  particular  action  and  observation  at  time  t. 
Entries  of  <&AO  that  require  a  particular  action  are  set  to  zero  for  other  actions.  This  implies  that 
$AO  is  block  diagonal. 

To  estimate  observation  features  from  data,  we  will  pretend  we  get  a  sample  (pf°  in  which 
we  evaluate  all  of  our  observation  features,  even  if  different  elements  of  (f)AO  depend  on  different 
actions.  To  get  such  a  sample,  we  can  pick  a  single  action  a  according  to  our  policy  given  ht  and 
receive  a  sample  observation  o.  We  then  compute  the  coordinates  that  we  can,  multiply  by  an 
importance  weight,  and  fill  in  zeros  everywhere  else: 


<T )AOp 

AQ  _  ^ k co-tOt 

=  P \at  =  a  |  ht' 


(4.4) 


The  expected  value  of  our  sample  vector  is  correct,  since  the  probability  of  selecting  our  action 
P[cq  =  a  |  hf]  cancels  with  the  weighting  factor  P[Qt ia\ht]  • 

Defining  PSRs  in  terms  of  observation  features  is  more  involved  than  extending  the  definition 
to  characteristic  and  indicative  features.  In  particular  we  have  to  be  careful  that  extending  to  fea¬ 
tures  does  not  interfere  with  the  Bayes’  rule  update  for  the  PSR  state.  In  Section  4.3.1  we  review 
how  PSRs  implement  Bayes’  rule  and  we  provide  a  matrix  algebra  equation  for  implementing 
Bayes’  rule.  In  Section  4.3.2  we  show  how  to  implement  Bayes’  rule  when  we  are  dealing  with 
observation  features.  Finally,  in  Section  4.4  we  define  a  PSR  in  terms  of  characteristic,  indicative, 
and  observation  features. 


4.3.1  Bayes’  Rule  with  Discrete  Observations 

Recall  from  Section  3.2.2  that  at  each  time  step  t  =  1,  2, . . .,  a  PSR  predicts  the  joint  probability 
of  observation  o  and  next  state  conditioned  on  action  a 

P  [x(ht+1),ot  =  o\ht,at  =  a]  =  Maox(ht)  (4.5) 

Then,  the  normalization  vector  m ^  marginalizes  out  the  next  state,  giving  just  the  probability  of 
observation  ot 

P  [ot  |  ht,  at ]  =  m^Maoxiht)  (4.6) 
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and  combining  these  two  equations,  the  next  PSR  state  is  found  by  Bayes’  Rule 


x(ht+ 1)  =  Maox  ( ht )  /m^MagX  ( ht ) 


(4.7) 


As  we  will  see  shortly,  it  can  be  useful  to  write  the  Bayes’  rule  update  as  a  set  of  matrix  opera¬ 
tions.  If  we  define: 


Px+,AO\ht 
Pao  ,AO\ht 


[  Maiolx{ht)  ■  ■  ■  ] 

im^oMai01x{ht)  0 

0  m^Ma^o^xiht) 


(4.8) 

(4.9) 


then  we  can  write  the  next  state  as  a  linear  function  of  an  indicator  vector  of  the  current  observa¬ 
tion: 


Px+  ,AO\htP^o,AO\hteao 


(4.10) 


where  eao  is  an  indicator  vector  that  when  multiplied  by  Px+,AO\ht  P^o  AO\ht  >  sc'ccts  the  column 
containing  the  correct  Bayes’  rule  update  x(ht+ 1)  =  Maox (ht) /rrij^ Maox (ht)  at  time  t. 

In  Section  4.3.2  below,  we  generalize  Equations  4.8,  4.9,  and  4.10  to  work  with  features  of 
observations  while  still  preserving  the  underlying  Bayes’  rule  for  updating  the  state  of  a  PSR. 


4.3.2  Bayes’  Rule  with  Observation  Features 

In  this  section  we  generalize  Px+,AO\ht  10  the  cross  covariance  of  next  state  and  features  of 
observations  conditioned  on  history  T‘AO,AO\ht  and  we  generalize  I:>AO.AO\h,  to  the  covariance  of 
features  of  observations  conditioned  on  history  T,Ao,AO\ht-  More  specifically: 


\A\x\0\ 


pUcMOifaly  =  V  4^P[o,  =  o\at  =  a, 

,aot 


AO  1 

ao 


ao= 1 

—  $AOp 


AO,AO\ht 


(4.11) 
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Next  we  define  the  covariance  matrix  of  observations  and  expected  next  state  conditioned  on 
history: 


l-4|x|0| 


[^x+,AO\ht\id  =  P  K+i’  °t  =  0\  hti  at  =  do(x^i)] 


ao=  1 

\A\x\0\ 

y  Xi(ht,  ot  =  o,at  =  a)P  [ot  =  o  \  ht,  at  =  a]  <f>4 

ao=  1 

-4  x:°\  n,  i  n  x 

V  \.MAAht)  pr  —  n  \  h  n  - 

2 ^  A/T  „(h\r  nti  at  -  a\ 


AO 

,ao 


^  m^Maoxiht 
\A\x\0\ 

y  [Mao]i.x{ht)*f, 


AO 

,ao 


T 


ao=  1 


,^CT 


(4.12) 


If  rank  (<f>4C))  >  \A\  x  \0\,  then  ch40^40  =  /:  the  features  uniquely  characterize  the  obser¬ 
vations  conditioned  on  actions.  This  implies  that  that  the  Bayes’  rule  update  from  Equation  4.10 
is  preserved: 


y  y  —  1  X-4C*  p  iAO~^  ( ,*..40 A  ^  p—  1  foAOl  foAO 

^x+^-AX)\h2^AO,AO\ht(t)t  ~  *X+,AO\ht W  ^4*  J  J  AO,AO\ht  ^  Cao 

=  Px+ ,AO\htPAO,AO\hteao 


(4.13) 


Conversely,  if  rank  (T40)  <  \A\  x  \0\,  then  then  Equation  4.13  is  only  approximately  the 
Bayes’  rule  update.  However,  even  when  Bayes’  rule  is  approximated,  using  features  can  be 
a  huge  advantage  in  practice:  the  fact  that  we  never  need  to  explicitly  estimate  or  represent  a 
probability  density  function  of  observations  can  outweigh  small  inaccuracies  in  the  Bayes  rule 
update.  We  explore  feature  representations  of  Bayes’  rule,  and  the  practical  advantages  of  this 
strategy,  in  more  detail  in  Chapter  5  where  we  discuss  and  use  the  recent  work  on  kernel  Bayes’ 
rule  for  learning  non-parametric  dynamical  system  models. 

Given  the  equations  for  computing  Bayes’  rule  with  feature  covariances,  we  are  now  in  a 
position  to  define  PSRs  with  observation  features  in  such  a  way  that  we  can  perform  Bayes’  rule 
updates  for  filtering  based  only  on  observable  quantities. 


4.4  Defining  PSRs  with  Observation  Features 

Defining  PSRs  with  observation  features  requires  combining  Bayes’  rule  from  the  previous  sec¬ 
tion  with  our  derivations  for  discovering  a  state  space  with  characteristic  and  indicative  features 
from  Section  4.2.  Specifically,  we  use  the  fact  that  =  $tTYjx,h  from  Equation  4.1c  to 
argue  that  the  PSR  state  space  can  be  found  by  a  spectral  decomposition  of  an  observable  covari¬ 
ance  matrix  and  we  use  Bayes’  rule  update  to  state  at  each  time  step  t  using  only  moments  of 
observable  quantities. 
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Instead  of  defining  one  set  of  moments  per  action-observation  pair  E r+,ao,H^  we  define  one 
set  of  moments  per  feature:  let  index  k  —  1 . . .  m  range  over  features,  and  define  a  3-mode  tensor 

S t+,ao,h'- 


[-7- .AO.  H.ikj  =  E 


do(C+) 


(4.14) 


Also,  instead  of  working  with  tensors  directly,  in  the  following  derivations  it  will  often  be  con¬ 
venient  to  project  a  tensor  like  ^t+,ao,h  to  a  matrix  as  follows: 


Vt+Mv)  =  E 


(V,  <%t)& 


do(C+) 


(4.15) 


We  can  think  of  the  matrix  ^r+,Ao(ri)  as  re- weighting  a  cross  covariance  of  characteristic  and 


indicative  features  =  E 


hT  1AOT 

Jt+i  (Pt 


do(C+)  by  a  scalar  (■ r hffi).  The  inner  product 
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(rj,  (ffi)  is  contracting  0^  down  to  a  single  dimension.  In  particular: 


JT+ 


,A0(V)\  .  .  =  E  [<t>T,t+l{ V,  |  do(C+)] 

=  E  [e  [</>lt+i(v,  4>t-)4>t?T  I  d°(C+)>  ht]] 

=  E  [e  [</>Zt+i$t?T  I  d°(C+)’ ht]  (1’ E  I  ht]  >] 

l-A|x|0| 

] >2  E  [^Zt+I  I  d°(C+),  ht,ot=o,at  =  i 


=  E 


=  E 


=  E 


=  E 


<J>-40T 

I  7  1  7,ao 

=  o,at  =  a  \  ht\- 


P[at  =  a  |  ht] 


to’ E  \4>f  I  ht]  > 


=  o\ht,at  =  a]P[at  =  a  |  ht]— — — -  -  -  -  )  (jj,  E  |^> 


l-AI  x  |0| 

E  E  I  do(C+)>  ht,ot  =  o,at  =  ' 

ao=  1 
l-A|x|0| 

E  E  l&h+i  I  do(C+),  ht,ot  =  o,at  =  < 

ao=  1 

|-4|X|°I  T  \  r  n 

y.  §jTx(ht,ot  =  o,at  =  a)P[ot  =  o  \  ht,at  =  a\®f®0  1  (»?,E  | \  ht J) 


=  a  |  ht] 

=  o  |  ht ,  at  =  a]3>taoT  )  (>?,E  [y  |  Tit]) 


=  y™ 


=  $vrE 


=  y  ™ 


=  y  TE 


=  y.TE 


l-4|x|0| 


y  x(ht,ot  =  o,at  =  a)P[ot  =  o  \  ht,at.  =  (»?,E  <!>¥  \  ht  ) 


ao=  1 
l-4|x|0| 


E 


MaoX(ht) 

'x  X^MaoXtot) 


P[ot  =  o\ht,at  =  a]^ZT  I  <*?>E  [y  I  ht]) 


\A\x\O\ 

y  Maox(ht)^0T  [y  i  ht] > 


l-4|x|OMS|  \ 

E  E  to,  e  [y  i  ht] ) 

ao=  1  1=1  J 

ISI  / 1 -4 1  x  |C?|  \ 

y  y  Mao^fz  <^E  [y  i  bt]  xt(ht)) 

1=1  V  ao= 1  J 


ISI  /l-4|x|0| 

= y r  e  E  M-oi*tz  {0  e  [e  [y  1  /it]  x^)] ) 


1  =  1  \  ao=  1 


ISI  / \A\  x  | O | 

=  yrE  E  M^taoT  )  (v^x,nw) 


1  =  1  \  ao=  1 

ISI 


=  $vrES^+,^0|Xi.i 
1=1 

ISI 

^T+,Aoto)  =  d5TrES^+,^0|.Yi  yh  SA', 


(4.16) 


where  the  covariance 


matrix  Ex+^o\xt  =f  (j^ao=l°^  From  Er+  jAO(v) 


we  can 


compute  a  linear  transform  of  '^x+,AO\ht: 

UtXt+M(UTVt,h)X)  =  f/TSr+i^((f/TSr,w)t(f/T$rr)a;(/ri)) 


IQI  /  |-4|x|0| 


uT$Tr  ^2  V  mm^°t  (((uTzTiH)'(uTi>Tr)x(ht),zx_Hl 


1=1  y  ao=  1 

Id  ( l-4|x|0| 


t/T$TrE  E  ]  Xi(ht) 

1=1  l  ao=  1 

Wld 

s^cT 

ao=  1 

\ 

rT  jf.T  t 


f/T$rr  ^  Maox(^)^ 


=  (f/'$/r)s.Y+,^|,t 

Next  we  need  to  find  T,AO,AO\ht-  We  start  by  defining  a  3-mode  tensor  IUe>,«,.4e>: 


[^AO^Aolui  ~  ® 


ilj 


it?  •  €  •  4>t? 


(4.17) 


(4.18) 


and  its  projection  ^ao,ao(v): 


P AO,Ao{ri)\i,j  =‘  E  4>^{r],(j)^)(t>fl 


lAO/m  jM\1AOT 


=  E 

E 

1  ht 

=  E 

E 

1 h t 

E  [(vA?)  1  ht] 

=  E 

E 

'wwr  1  ht 

{n,  E  [<$  |  ht]  ) 

=  E 


=  E 


l-4|x|0|  *,AO(f)AOT 

E  P[a’al  a' ]°ht]F  [°*  =  °> a*  =  a  I  (*?»E  l>t*  I  ^ ) 

1-41  x|o|  ^o^oT 

E'E  Et~TP  [o*  =  o  I  a*  =  a,  /it]  P[at  =  a  |  /it]  <17,  E  [<^  |  /it]  > 


ao— 1 
l-4|x|C| 


P[at  =  a  |  ht] 


=  E  E  [P  [ot  =  o  I  at  =  a,  ht]  (■ rj ,  E  [>*  |  />*] )] 

ao— 1 
l-4|x|0| 

=  E  E  E  [P  [<H  =  o  I  at  =  a,  ft*]  E  [</»«  |  ft,]  ] ) 

ao— 1 

\A\x\0\ 

ao— 1 

\A\x\O\ 

=  E  ^fo^a°oT(7,Ej,H5ao) 

ao— 1 

1-41  x|0| 

XA0,A0(v)=  E 

ao— 1 


(4.19) 
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Given  T,AO,AO\ht  and  the  current  state  bt,  we  can  calculate  T,AO,AO\ht  as  follows.  We  see  that  we 
can  compute  the  probability  F[ot  —  o  \  at  —  a,  ht]  from  the  current  state  bt: 


gho^xM  ( UT  St,-k  )  ^  h  =  gJ0^xAUT^T,n)HUT^TT)x(ht) 

=  ^0(c/T$rr)-1(c/T$rr)Ex,w(c/TEr,w)t(c/T$rr)x(^) 

=  gJo(UT^rT)-\UT^rT)x(ht) 

=  9aox(ht) 

=  P[o*  =  o  \  at  =  a,  ht\  (4.20) 

Combining  Equation  4.19  and  Equation  4.20,  we  can  find  the  conditional  embedding  of  the 
covariance  T,Ao,Ao\ht- 


\A\x\0\ 

T.ao,ao((UtY.t,h)\)=  E  ((UtZt,h)\XI,hS° o> 

ao=  1 

\A\x\0\ 

ao=  1 

=  ^AO,AO\ht  (4.21) 

Finally,  we  can  implement  the  recursive  Bayes’  rule  for  PSRs  with  only  observable  features: 

h+i  =  (UT®TV)xt+l 

=  (UT  $rr )  Ex+ AO\hi  ^2o,AO\ht  ^t° 

=  U  ZT+,AO\ht^AO,AO\httf 
=  UT'ZT+,AO\ht  {^AO,AO  {(UTT,t,h)\)Y1 

=  UtZt+,ao  {(UTJ:r,n)%)  Yao,ao  ((t^Er,*)^))-1  (4.22) 

Importantly,  note  that  in  the  final  line  above  all  quantities  are  obserx’able. 

We  can  now  define  our  spectral  learning  algorithm  for  PSRs  with  many  observations.  Unlike 
the  few-observation  case,  to  save  space  and  time,  we  do  not  precompute  and  store  Bao  for  each 
action-observation  pair.  Instead,  we  compute  the  Bayes’  rule  state  update  as  needed  at  tracking 
time  via  Equation  4.22. 

4.4.1  The  Moment  Spectral  Learning  Algorithm  for  PSRs 

The  equations  from  the  previous  section  yield  a  simple  spectral  learning  algorithm.  Our  al¬ 
gorithm  will  estimate  the  moments  Tjt+,ao,-h^  and  ^aomao  from  data  and  then  use 

Equation  4.22  to  update  state.  Just  as  before,  once  we  have  computed  the  estimated  moments 
E-7 -ft,  Fit+aom  and  E ao,h,ao,  we  can  generate  U  by  singular  value  decomposition  of  E r,n- 
For  reference,  we  summarize  the  learning  algorithm  here: 

1.  Compute  empirical  estimates  of  the  moments:  E7-/H,  E r+,AO,n  and  E ao,h,ao- 
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2.  Compute  a  SVD  of  Hr,u  10  And  U,  the  matrix  of  left  singular  vectors  corresponding  to  the 
d  largest  singular  values. 

3.  Compute  the  initial  state:  5*  =  UT Pr,ue- 

4.  At  each  time  step  update  state  with  generalized  Bayes’  rule: 

bt+i  =  UTtT+AO  ^UT^T,nfbt)  (^Ao,Ao  ((C^Et.w)^))”1  <P?°. 

As  we  include  more  data  in  our  averages,  the  law  of  large  numbers  guarantees  that  our  estimates 
Sr,w,  ^T+,aom  and  T,Ao,n,AO  converge  to  the  true  matrices  STiW,  and  S ao,h,ao ■  So 

by  continuity  of  the  formulas  above,  if  our  system  is  truly  a  PSR  of  finite  rank,  and  our  features 
are  sufficiently  expressive,  our  estimates  6*  and  Bayes’  rule  converge  to  the  true  parameters  up 
to  a  linear  transform — that  is,  our  learning  algorithm  is  consistent.  2 


2Again,  with  the  same  caveat  about  the  SVD. 
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Chapter  5 

Hilbert  Space  Embeddings  of  Predictive 
State  Representations 


The  standard  spectral  algorithm  for  PSRs  derived  in  Chapter  3  is  formulated  for  discrete  random 
variables,  and,  in  Chapter  4,  an  efficient  spectral  learning  algorithm  was  derived  for  large  sets  of 
actions  and  observations.  The  approach  in  Chapter  4  utilized  finite  sets  of  features  of  observa¬ 
tions,  tests,  and  histories.  In  this  chapter  we  will  explore  this  concept  further  and  examine  how 
to  extend  these  ideas  to  general  cases  with  continuous  and  structured  variables. 

In  this  chapter,  we  fully  generalize  PSRs  to  (potentially)  continuous  observation  and  action 
sets  using  a  recent  concept  called  Hilbert  space  embeddings  of  distributions  [94,  103].  The 
essence  of  our  method  is  to  represent  distributions  of  tests,  histories,  observations,  and  actions, 
as  points  in  (possibly)  infinite -dimensional  Hilbert  spaces.  During  filtering  we  update  these 
points  entirely  in  the  Hilbert  spaces  using  a  kernel  version  of  Bayes’  rule. 

The  proposed  method  is  similar  to  recent  work  that  applies  kernel  methods  to  dynamical 
system  modeling  and  reinforcement  learning,  which  we  summarize  here.  Song  et  al.  [99]  pro¬ 
posed  a  nonparametric  approach  to  learning  HMM  representations  in  RKHS.  The  resulting  non- 
parametric  dynamical  system  model,  called  Hilbert  Space  Embeddings  of  Hidden  Markov  Mod¬ 
els  (HSE-HMMs),  proved  to  be  an  impressive  alternative  to  competing  dynamical  system  models 
on  a  number  of  experimental  benchmarks.  Despite  these  successes,  the  HSE-HMM  has  two  ma¬ 
jor  limitations:  first,  the  update  rule  for  the  HMM  only  approximates  the  state  update  (up  to 
a  fixed  multiplicative  scalar)  instead  of  performing  direct  Bayesian  inference  in  Hilbert  space. 
Second,  the  model  lacks  the  capacity  to  reason  about  actions,  which  limits  the  scope  of  systems 
that  can  be  modeled.  Our  model  can  be  viewed  as  an  extension  of  HSE-HMMs  that  adds  control 
inputs  and  updates  state  using  a  kern  el  i  zed  version  of  Bayes’  rule. 

Griinewalder  et  al.  [37]  proposed  a  nonparametric  approach  to  learning  transition  dynam¬ 
ics  in  Markov  decision  processes  (MDPs)  by  representing  the  MDP  transitions  as  conditional 
distributions  in  RKHS.  This  work  was  extended  by  Nishiyama  et  al.  [73]  who  developed  a  non¬ 
parametric  approach  for  policy  learning  in  POMDPs  that  represents  distributions  over  the  states, 
observations,  and  actions  as  embeddings  in  a  RKHSs.  The  resulting  model  is  called  Hilbert  Space 
Embeddings  of  POMDPs  (HSE-POMDPs).  Distributions  over  states  given  the  observations  are 
obtained  by  applying  the  kernel  Bayes  rule  to  these  distribution  embeddings.  Policies  and  value 
functions  are  then  defined  on  the  feature  space  over  states.  Critically,  the  authors  only  provided 
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results  for  fully  observable  models,  where  the  training  data  includes  labels  for  the  true  latent 
states.  By  contrast,  our  algorithm  only  requires  access  to  an  (unlabeled)  sequence  of  actions  and 
observations. 


5.1  Hilbert  Space  Embeddings 

At  a  high  level,  we  will  be  extending  the  idea  of  feature  representations  from  Chapter  4  from 
finite  feature  spaces  to  function  spaces.  This  requires  some  additional  concepts  and  tools  for 
reasoning  about  statistics  in  reproducing  kernel  Hilbert  spaces  (RKHSs),  which  we  review  here. 
However,  once  these  concepts  are  introduced,  the  observable  representation  of  PSRs  in  RKHSs 
can  be  seen  as  identical  to  the  representation  in  Chapter  3  or  Chapter  4.  (For  example,  if  we  use 
the  discrete/delta  kernel  on  a  finite  set  of  observations,  we  recover  the  algorithm  of  Chapter  3; 
and  if  we  use  a  finite-dimensional  RKHS,  we  recover  the  algorithm  of  Chapter  4.)  The  sole 
practical  difference  in  the  learning  algorithm  is  the  use  of  the  kernel  trick  to  contend  with  the 
fact  that  the  characteristic,  indicative,  and  observation  features  are  no  longer  required  to  be  finite¬ 
dimensional.  We  will  derive  the  algorithm  using  infinite-dimensional  operators  first,  since  that 
representation  best  shows  the  analogy  to  Chapters  3-4.  Then,  starting  in  Section  5.3  we  will 
show  how  to  implement  the  algorithm  using  only  finite-dimensional  Gram  matrices. 


5.1.1  Mean  Maps 

Let  T  be  a  reproducing  kernel  Hilbert  space  (RKHS)  associated  with  kernel  Kx(x,  x')  = 
{fx{x),(j)x{x'))T.  Then  for  all  functions  /  6  J  and  x  e  X  we  have  the  reproducing  property: 
{f,<t>X(x))T  =  f(x),  i.e.  the  evaluation  of  function  /  at  x  can  be  written  as  an  inner  product. 
Examples  of  kernels  include  the  Gaussian  RBF  kernel  Kx(x,  x')  =  exp(— s  ||a;  —  x'  ||2),  however 
kernel  functions  have  also  been  defined  on  strings,  graphs,  and  other  structured  objects. 

Let  V  be  the  set  of  probability  distributions  on  X,  and  X  the  random  variable  with  distri¬ 
bution  P  e  V.  Following  Smola  et  al.  [94],  we  define  the  mean  map  of  P  e  V  into  RKHS  T 
to  be  px  =  E  [<f>x( X)],  also  called  the  Hilbert  space  embedding  of  P  or  simply  mean  map. 
For  all  /  6  J,  E[f(X)]  =  (/,  px)jr  by  the  reproducing  property  and  linearity  of  expectations. 
We  may  think  of  the  mean  map  by  analogy  with  a  mean  vector  in  a  finite  dimensional  space:  if 
T  =  Wl,  then  /  e  Wl  is  some  fixed  vector,  X  is  a  random  vector  defined  on  W1  with  mean  fix, 
and  E  (/,  X)  -p  =  fT  px. 

A  characteristic  RKHS  is  one  for  which  the  mean  map  is  injective:  that  is,  each  distribution 
has  a  unique  embedding  [103].  This  property  holds  for  many  commonly  used  kernels  (eg.  the 
Gaussian  and  Laplace  kernels  when  X  =  Md). 

As  a  special  case  of  the  mean  map,  the  marginal  probability  vector  of  a  discrete  variable  X 
is  a  Hilbert  space  embedding,  i.e.  (P[9f  =  i])^  =  px.  Here  the  kernel  is  the  delta  function 
Kx{x ,  x')  =  I(x  =  x),  and  the  feature  map  is  the  1-of-M  representation  for  discrete  variables. 


44 


(5.1) 


Given  t  i.i.d.  observations  {xt}J=1,  an  estimate  of  the  mean  map  is  straightforward: 

t-yit 

t= i 

where  YA  =  {0X  {x\), . . . ,  0X  {xj))  is  the  operator  which  maps  the  /th  unit  vector  of  M7 
to  0X (xt),  which  we  can  think  of  as  an  arrangement  of  feature  maps  into  columns.  Assume 
Kx(x,  x')  bounded;  then  with  probability  1  —  5,  this  estimate  computes  an  approximation  with 
an  error  of  || ~  dx\\ T  =  0(T"1/2(log(l/5))1/2)  [94], 

5.1.2  Covariance  Operators 

The  covariance  operator  is  a  generalization  of  the  covariance  matrix.  Given  a  joint  distribution 
P  [A",  Y]  over  two  variables  X  on  X  and  Y  on  J^the  uncentered  covariance  operator  Cxy  is  [6] 

Cxr  =  EA-y  [4 ft[X )  ®  ft' (Y)]  ,  (5.2) 

where  ®  denotes  tensor  product.  Alternatively,  Cxy  can  simply  be  viewed  as  an  embedding  of 
joint  distribution  P  [X,  Y ]  using  joint  feature  map  ftx,  y )  =f  <px{x)  <E>  ft  (■ y )  (in  tensor  product 
RKHS  T  A  Q).  Given  two  functions,  f  G  T  and  g  e  Q.  their  cross-covariance  is  computed  as: 

(/,  Cxvg) T  or  equivalently  (/  0  g,  CXy ) ■  (5-3) 

For  discrete  variables  X  and  Y  with  delta  kernels  on  both  domains,  the  covariance  operator  will 
coincide  with  the  joint  probability  table,  i.e.  f[x  =  i,  Y  =  j]"=1  =  Cxy. 

Given  T  pairs  of  id.d.  observations  {(xt,yt)}J=1,  we  denote  YA  =  ftx(xi), . . .  ,4>x(xT)) 
and  T'  =  [ft  (yd), . . . ,  ft  (■ yT )) .  The  covariance  operator  CXy  can  then  be  estimated  as 

Cxy  =  irxTy*  (5.4) 

where  T*  denotes  the  adjoint  of  T.  Assume  Kx[x ,  x ')  and  Ky[y ,  y')  are  bounded.  With  prob¬ 
ability  1  —  5,  this  estimate  computes  an  approximation  with  an  error  of  \\Cxy  —  Cxy\\t®G  — 
0(T-1/2(log(l/5))1/2)  [94], 

5.1.3  Conditional  Embedding  Operators 

Based  on  covariance  operators,  Song  et  al.  [98]  define  a  conditional  embedding  operator  which 
allows  us  to  compute  conditional  expectations  E  \ftY (Y)  |  X]  as  linear  operators  in  RKHS.  We 
define  the  conditional  embedding  operator  Wy|x  :  T  1-0  Q 

Wy\x  =  CyxCftx  (5.5) 

1  We  assume  a  kernel  Ky{y ,  y')  =  [cjd  (y),  (jd  ( y'))g  is  defined  on  y  with  associated  RKHS  Q. 
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such  that  for  all  g  e  Q 


E{g(Y)\x\  =  (g,WY\x<pX(x))g 

given  several  assumptions.  First  we  assume  E[<?(Y)  I  =  ■]  6  J  for  all  g  e  Q.  Next 
we  define  the  operator  S  :  Q  i-)-  T  such  that  Sg  =  E [g(Y)  j  A"  =  ■]  6  J,  which  implies 
CXy9  =  CxxE[g{Y)  j  X  =  ■]  [34].  If  we  further  assume  that  Cxx  is  injective,  we  can  write 

CxxCxvg  =  E[c/(Y)  j  X  =  ■]  meaning  that  S  =  CXXCXY.2  Finally  we  assume  that  S  is 
bounded  which  implies  it  has  an  adjoint  S*.  By  the  reproducing  theorem, 

E [g(Y)  \X  =  x]  =  {E\g(Y)  \  X  =  -],<i>x(X))r 

=  (■ Sg,<t>x(x))jr 
=  {g,S*(pX(x))jr 
=  {QiCyxCxx^  (x))x 


Therefore,  there  exists  a  VVy|x  =  CYxCXx- 

For  discrete  variables  with  delta  kernels,  conditional  embedding  operators  correspond  exactly 
to  conditional  probability  tables  (CPT),  i.e.  (P  [Y  —  i  \  X  —  j])fj= i  =  CY\X,  and  each  individual 
conditional  embedding  corresponds  to  one  column  of  the  CPT,  i.e.  (P  j  Y  =  i  \  X  —  z]X=i  = 
f^Y\x  • 

Given  T  i.i.d.  pairs  {(xt,yt)}J=i  from  P[3f,  Y],  the  conditional  embedding  operator  can  be 
estimated  as 

'Y'Y'Y'X*  /"pX'Y'X*  \  — 1 

Wy,x  =  — ^ +  XI)  =  T  y(Gx,x  +  XTiyV*  (5.6) 

where  we  have  defined  the  Gram  matrix  Gx,x  =  TA*TA  with  (i,  j)th  entry  Kx{  Xi ,  Xj  ).  The 
regularization  parameter  A  helps  to  avoid  overfitting  and  to  ensure  invertibility  (and  thus  that 
the  resulting  operator  is  well  defined).  Under  strong  assumptions  Song  et  al.  [100]  show  that 
Wy|y  converges  to  its  population  counterpart  in  probability.  However,  convergence  should  also 
be  guaranteed  under  weaker  assumptions,  see  [34]  for  details. 

5.1.4  Kernel  Bayes’  Rule 

We  are  now  in  a  position  to  define  the  kernel  mean  map  implementation  of  Bayes’  rule  (which 
we  call  Kernel  Bayes’  Rule  (KBR)).  In  deriving  the  kernel  realization  of  Bayes’  rule  we  use 
conditional  embedding  operators  to  obtain  the  kernel  mean  representation  of  a  conditional  joint 
probability  P  [X,Y  \  Z  —  z\.  Given  Hilbert  spaces  J~.  Q.  and  H  corresponding  to  the  em- 

def 

bedding  of  x,  y,  and  z  respectively,  this  can  be  represented  as  an  RKHS  operator  Cx,y \z  — 

2Note  that  we  do  not  require  Cxx  to  be  bounded:  in  fact,  it  wont  be  bounded  if  the  eigenspectrum  is  countable, 
which  will  be  the  case  for  a  Gaussian  kernel.  But  the  composite  operator  Cx\Cxy  can  still  be  bounded  if  the 
singular  values  of  CXY  decay  fast  enough,  and  its  singular  vectors  are  somewhat  aligned  with  those  of  Cxx- 
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E  \4>x( X)  ®  (j)Y(Y)  |  z] .  We  define  the  conditional  covariance  operator  'H  T  0  Q  given 
several  assumptions  [34]: 


Cxy\z  —  C(xy)zCZz4>(z)  (5-7) 

Here  the  covariance  operator  C(xy)z  is  defined  by  the  random  variable  ((A",  Y),Z). 

Our  goal  is  to  perform  a  Bayes  rule  update  P  [A"  |  Y  —  y,Z  =  z\  =  f^Y=y\z=-]  RKHS. 
We  accomplish  this  by  defining  kernel  Bayes’  rule  in  terms  of  conditional  covariance  opera¬ 
tors  [34]: 


dX\y,z  CxY\z('YY\z(t*(.y')  (5-8) 

For  discrete  variables  with  delta  kernels,  KBR  corresponds  exactly  to  the  discrete  Bayes’  rule 
update  i.e.  ^X?^''lyz\Zy1'  j  =  Wx\ y,z,  and  each  individual  conditional  embedding  corre¬ 

sponds  to  one  column  of  the  CPT,  i.e.  (P  [A"  =  i  \  Y  —  y,  Z  —  z])^  =  Hx\y,z- 

Given  T  i.i.d.  triples  {(xt,  yt,  Zt)}J=  i  from  P  [A,  Y,  Z],  we  denote  TA  =  (cj)x(x i), . . . ,  4>x(xt )) , 
T1  =  (01  (j/i) , . . .  ,(j)y  (yr)),  and  Yz  =  (0z(Ai), . . . ,  4>z(zt ))•  The  Bayes’  rule  update  can  be 
estimated  by  first  estimating  the  conditional  embedding  of  the  covariance  operators  CXy\z  and 

Cyy\z  via  the  Equations  in  Section  5.1.3,  and  then  estimating  Cx\y,z  =  Cxy\z  (Cyy\z  +  01  ( y ). 

We  can  express  this  equation  with  Gram  matrices  as  follows  [34]: 

A2  =  diag  {{Gz,z  +  AT/)-1TyVz(^))  (5.9) 

Cx\y,z  =  Tx\zGy,y{{\zGyx)2  +  XT  I)"1  AzYy*  (pY  (y)  (5.10) 

where  we  have  defined  the  Gram  matrix  Gy,y  =  T'  T'  with  [i.  j ) th  entry  KY(yi ,  y3 )  and  Gram 

matrix  Gz.z  =  YZ*TZ  with  [i,  j)th  entry  Kz(zi,  Zj).  The  function  diag(-)  takes  a  vector  as  input 
and  returns  a  diagonal  matrix.  The  diagonal  elements  of  A~  weight  the  samples  thereby  encoding 
the  conditioning  information  from  z.  Since  the  weights  may  be  negative,  we  use  a  different  type 
of  regularization  than  the  standard  Tikhonov  regularization  used  in  Equation  5.6  [34]. 


5.2  Predictive  Representations  in  RKHS 

We  will  focus  on  learning  the  conditional  embedding  operator  WTo\TA  ht  for  the  predictive  den¬ 
sity  of  a  core  set  of  tests  P  [rf  |  r/4,  ht\  of  a  PSR  and  updating  this  conditional  embedding 
operator  given  a  new  action  and  observation  using  kernel  Bayes’  rule.  In  this  chapter  we  assume 
that  our  data  was  generated  by  a  blind  policy,  where  future  actions  do  not  rely  on  future  obser¬ 
vations.  This  means  that  the  PSR  state  is  the  conditional  probability  of  observation  sequences 
given  action  sequences  (we  do  not  need  to  worry  about  interventions).  This  is  an  expressive  rep¬ 
resentation:  we  can  model  near-arbitrary  continuous-valued  dynamical  systems,  limited  only  by 
the  existence  of  the  conditional  embedding  operator  Wro\TA  hl  (and  therefore  the  assumptions 
in  given  in  Section  5.1.3). 
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5.2.1  Conditional  Predictive  Representations 

We  begin  by  defining  kernels  on  core  test  observations  r°,  core  test  actions  ta,  observations  o, 
actions  a,  and  histories  h: 


Kto(t0,t'°) 

(5.11) 

Kta(ta,t'A) 

=  (« ^V),*rV% 

(5.12) 

K0  (o,  o') 

=  {4>°  {o),4>°  (o))J 

(5.13) 

K_ 4  (a,  a') 

=  (0A  (a)  i<t>A(a'))K 

(5.14) 

Kn(Kh') 

=  {4>H  ( h),4>U  (h'))c 

(5.15) 

Predictive  State  in  RKHS  Given  the  covariance  operator  for  embedded  action  sequences 
Ctaja  =f  E  <J)tA{ta)  ®  (f)rA (ta)  and  the  cross  covariance  operator  between  embedded  ob¬ 
servation  sequences  and  action  sequences  Cro  :ta  =f  E  (f>T°  (t®)  <g>  4>tA  (ta)  ,  the  Hilbert 
space  embedding  of  tests  conditioned  on  a  single  action  sequence  is  the  conditional  embedding 
detailed  in  Section  5.1.3: 

dr° \ta  —  Ct°  ,rA^rA,rA  <t>TA(rtA)  (5.16) 

However,  a  predictive  representation  of  the  future  is  a  collection  of  distributions  of  tests,  one  per 
action  sequence.  We  can  represent  the  PSR  state  as  a  conditional  embedding  operator  Wto\Ta\ 

Wt°\ta  —  ^T°,TA^rA,TA  (5.17) 

From  this  operator  we  can  find  the  mean  embedding  for  any  action  sequence  as  pro\TA  = 
WTo\TA(t)TA (ta)  by  Equation  5.16. 

If  we  want  to  compute  the  PSR  state  given  a  particular  history  ht,  we  define  several  new  co- 
variance  operators  and  apply  KBR.  We  first  define  Ctoj-a =f  E  (f)T°  (rf )  ®  (prA  (rfA)  ®  4>n(ht )  , 

Cj- a  j- a =f  E  4>tA(ta )  ®  4>TA{tA)  ®  (pn(ht)  ,  and  CH,n  '=  E  [(fin(ht)  0  <j>H(ht)\ .  We  then 
compute  the  conditional  covariance  operators 

C T°,TA\ht  =  TA)nCn]n^H  0lt)  (5.18) 

C TA,TA\ht  —  C{TATA)uCn]'H(l)H(h't)  (5.19) 

Finally,  we  can  compute  the  conditional  embedding  operator  Wto\tam: 

yVTo\TA,ht  —  C-T°  ,TA\ht^TA,TA\ht  (5.20) 

The  conditional  embedding  operator  V^ro\rA  ht  is  the  kernel  mean  embedding  of  P  [rf  |  ta,  ht] , 
i.e.  it  is  the  Hilbert  space  embedding  of  a  PSR  state.  It  is  a  predictive  state  since  it  uniquely 
encodes  the  predictive  density  of  future  observations  in  the  dynamical  system  given  future  action 
sequences.  In  Section  5.2.2  we  detail  how  to  find  a  PSR  state  in  RKHS  given  a  previous  state 
and  an  action-observation  pair,  instead  of  needing  to  use  the  entire  history. 
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In  Sections  5.2.2  and  5.3  below  we  break  the  PSR  state  Ydro\rAht  down  into  its  constituent 
parts,  the  conditional  embedding  operators  CTo yrA\ht  and  CrA:rA\hr  We  then  update  these  two 
operators  instead  of  Wro\rA,ht  directly.  This  representation  is  at  least  as  expressive  since  we  can 
always  reconstruct  WT°\TA,ht  =  ,TA\hfiT\  rA\ht • 

Conditional  Observations  Here  we  use  the  rules  for  conditional  embedding  from  Section  5.1.3 
to  show  how  to  calculate  the  conditional  embedding  of  an  observation  distribution  given  a  state 
and  an  action.  We  start  by  showing  how  to  embed  the  observation  distribution  of  a  stateless 

process.  We  define  Ca,a  '=  E  [(/)A(at)  8  (pA(at)]  and  C0,a  =  E  [c )  (8)  (pA(at)] .  The  condi¬ 
tional  embedding  of  the  observation  distribution  is  therefore: 

fJ>o\at  =  CoA^AA^t)  (5-21) 

As  with  the  Hilbert  space  predictive  state,  we  actually  want  to  calculate  the  embedding  of  the 
PDF  of  the  observation  conditioned  on  the  current  action  and  history  ht.  Therefore  we  calculate 
the  following  conditional  embeddings  of  covariance  operators 

Cc>A\ht  =  ^(OA)H^'HAL(l^i{ht)  (5.22) 

C-AA\ht  =  CiAAwCuM^n  (M  (5.23) 

The  embedding  of  the  conditional  observation  given  an  action  and  history  is  given  by  KBR: 

do\ht,at  =  ^OA\htCA^A\ht(pA(at)  (5.24) 


5.2.2  The  State  Update 

To  implement  the  Bayes’  rule  state  update  we  start  with  a  PSR  state  at  time  t,  then  take  an 
action,  receive  an  observation,  and  incorporate  this  information  into  the  PSR  to  get  the  state 
at  time  t  +  1.  In  what  follows,  we  need  to  be  careful  about  independence  between  different 
random  variables.  For  example,  if  we  evaluate  (f>rC  (rf°)  and  0r°  (t®+1)  at  the  same  time  step,  the 
realizations  will  not  be  independent,  even  conditioned  on  the  state  of  the  process — if  we  wanted 
independent  realizations  of  (j)T°  (rf)  and  0T°  (rf®  x),  we’d  have  to  assume  the  ability  to  reset  the 
system  to  a  desired  state.  Below  we  will  take  care  that  we  only  ask  for  operators  which  we  can 
estimate  without  resets.  (If  we  didn’t  have  to  obey  this  constraint,  the  algorithm  would  become 
somewhat  simpler,  but  the  need  for  resets  would  constrain  applicability.)  Therefore,  in  addition 
to  the  covariance  operators  Ctotau  and  CrATAH  defined  above  we  define  several  additional 
covariance  operators  Cro+ ja+  pa,H'  ^o.oa'H^  and  CAa,u,  which  are  needed  to  update  the  state. 


CroTA)H 

def 

E 

-o 

-t 

0 

(rf) 

®  4>tA(ta)  8)  4>n{ht ) 

(5.25) 

Ctaaa,h 

def 

E 

'  -e- ' 

■5 

8  4>tA(ta)  8  4>n(ht) 

(5.26) 

Ct°+,ta+,oa,h 

def 

E 

I  I 

"O- 

-t 

0 

ir?+ 1 

)  8  4>TA(tA i)  8  4>°{ot)  8  4>A(at)  8  (j>H(ht) 

(5.27) 

Co, O  Adi 

def 

E 

_(p°(ot )  (8) 

4>°{ot)  8  <f>A(at)  8  (j)n(ht)] 

(5.28) 

Caa,h 

def 

E 

_<t)A{at)  <8> 

4>A(at)  8  (pn(ht) 

(5.29) 
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Unlike  Equations  5.18,  5.19,  5.22,  and  5.23  we  don’t  want  to  compute  the  conditional  embed¬ 
dings  of  covariance  operators  CrorA\ht,  CTATA\ht.  Co,A\ht >  and  Ca.aa,  directly  from  histories, 
but  rather  from  the  previous  state.  Since  CTo:TA\ht  is  the  characteristic  embedding  of  the  proba¬ 
bility  distribution  of  all  tests,  we  assume  that  we  can  compute  the  embedding  of  the  probability 
distribution  of  any  subset  of  observations,  actions,  or  tests  with  a  conditional  covariance  operator 
from  this  embedding.  For  example, 

CA,A\ht  =  ^[<f)A(at)  <8>  <f)A(at )  I  lit } 

=  E[E[0'A(at)  0  <j)A(at)  \  (t>r°  (rf)  0  4>rA(rA)]  \  ht] 

=  E  Waa\t°,ta  (^T°(t?  )  ®  0T'Vf“4))  |  ht 
=  y^AA\T°,T^[4>T°  (r^)  0  4>TA(rA)  I  fh] 

=  WaA\T°  ,Ta^T°  ,TA\ht 

Specifically,  we  assume  that  the  operators  yVr°+,TA+,o,A\T°,TA^  ^ta.ta\t°.ta-  W o,o,a\t°,ta > 
and  W a.a\t° .TA  exist  and  that  we  can  pseudo-invert  C(r°,TA)H  such  that  C(ro %TA)HC^TO  ta)H  = 
I.  We  see  that  we  can  then  compute  these  operators  from  the  covariances  given  in  Equa¬ 
tions  5.25-5.29: 


C(T°+,TA+,o,A)n^lro:rA)n  =  ^t°+,ta+,oa\h^hm^\to,ta)h 

—  W7-0+  ,ta+  ,o,a\t°  ,taWt°  ja\uC-u,uC\to  j-a)h 
=  W7-0+ ,ta+  ,o,a\t°  ,TA^-(T°  ,ta)hC\t°  ,ta)h 
=  ’W'T°+ ,TA+  ,0,A\T°  ,TA 

C(Ta,ta)hCIto,ta)h  =  ^TA,rA\n^n,H^lTorA)n 

—  ~Wta  trA\r°  ,rA  W7-0  ,rA  \  ■hPh,h^’\t°  rrA)n 
=  Wta  trA\T°  ,rA^{T°  ,rA)n^lro  ,ta)h 

—  'Wj'A  q~A  |  q~0  q~A 

C(o,o,A)-H^lrotrA)n  =  ^o,o,A\n^u,u^\rojA)H 

=  Wo,0,A\T°  ,TaAIt°  ,Ta\H^H,hC^to  ^A)^ 

=  W0to,A\T0,rA^{T°,rA)n^lrotrA)H 
=  Wo,o,a\t°,ta 

C(a,a)h^\t0  ,ta)h  =  Wa,a\ n^H,H^lro  ^rA)n 

=  W a,A\T°  ,TaC(T°  ,Ta)hCh,hC^to tTA)H 

=  A^a,a\t°  ,ta^(t°  ,TA)n^lro  trA)n 
=  A?a,a\ t°,ta 


(5.30) 


(5.31) 


(5.32) 


(5.33) 
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Updating  State  with  Bayes’  Rule 

Using  the  above  equations,  we  can  find  covariance  operators  conditioned 
based  on  the  covariance  CrojA\ht  at  time  t: 

on  history  at  time  t, 

c T°%rA+,0,A\ht  —  ^T°+,TA+,0,A\T°,rA^T0,TA\ht 

(5.34) 

CrA,TA\ht  =  ^ta,ta\t°  ,ta^t°  ,rA\ht 

(5.35) 

Co,0,A\ht  =  y^O,a,A\T°  ,TA^r°  ,TA\ht 

(5.36) 

C A,A\hi  =  W ’a,A\T°  ,Ta^T°  ,TA\ht 

(5.37) 

Finally,  in  order  to  update  our  state,  we  execute  two  instances  of  KBR.  First,  when  we  choose  an 
action  at,  we  update: 

Ct°+  ,Ta+  ,0\ht,at  =  ^(T°+,TA+,0)A\ht^AA\ht^A^at^  (5.38) 

Co,o\ht,at  =  ^{o,o)A\htCA1A\ht(l)A(at)  (5.39) 

Next,  when  we  receive  the  observation  generated  by  the  system,  ot,  we  incorporate  it  to  calculate 
the  joint  conditional  covariance: 


CT0+ tT^+\ht,at,ot  ~  ^{T°+rrA+)0\ht,afio]o\ht,at^  (°*)  (5.40) 

Finally,  the  joint  conditional  covariance  at  time  t  +  1  is  identified  as  Cto+  ta+  \huat,0t'. 

C-T°,TA\ht+i  =  Cr°+,TA+\h  t,<H,Ot  (5.41) 

The  PSR  state  can  now  be  computed  from  Equation  5.35  and  Equation  5.20: 

c TA,TA\ht+1  —  y^TA,TA\T0,TA^T0,TA\ht+1  (5-42) 

W r°\TA,ht+i  =  ,rA\ht+iCTA,TA\ht+i  (5.43) 

5.2.3  A  Minimal  State  Space 

Instead  of  working  with  mean  embeddings  in  generic  RKHSs,  it  is  sometimes  possible  to  embed 
distributions  in  a  finite -dimensional  subspace  of  the  RKHS.  For  discrete  action-observation  PSRs 
with  delta  kernels,  such  a  subspace  corresponds  to  a  core  set  of  tests  (see  Chapter  3  for  details). 
In  the  more  general  case,  we  can  factor  the  conditional  embedding  of  the  covariance  operator 
CTojA\ht  into  a  finite  dimensional  operator  Cxo,xA\ht  and  a  conditional  covariance  operator  LA. 
Here  X  =  ( X° ,  XA)  is  a  finite  set  of  linear  combinations  of  tests,  which  we  choose  to  make  the 
factorization  possible. 


C T°,TA\ht  —  MCX0'XA\ht  (5-44) 

Analogous  to  the  discrete  case  we  can  find  U  by  performing  a  ‘thin’  SVD  of  the  covariance  oper¬ 
ator  C(T0'Ta)H  and  taking  the  top  d  singular  vectors  as  LA.  (So  we  are  choosing  X  so  that  lA'U  = 
/.)  Instead  of  using  the  infinite-dimensional  operators  y\Lq-o+  ta+  0  ta,  Wj-a  ta \To  ta  , 
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Wo,o,a\t°,ta’  and  Wa,a\t°,ta  in  conjunction  with  KBR  to  update  state,  we  use  finite  di¬ 
mensional  operators  Wx°+  ,xA+  ,o,A\x° ,xA->  WxA,xA\x° ,xAi  W o,o,A\x° ,xA >  snd  VV^^i x° ,xAi 
which  we  can  find  as  follows: 

U*C(ro+j-A+t0)A)'H^M*C(T°,TA)'H.)^  —  U*y\>ro+  ja+  ,o,A\T°  ,TA^(T°  ,TA)H^M*C(T°  ,TA)u)^ 

=  U*Wr°+,TA+,o,A\T0,TAMC(x°,xA)H  (C(x  0,xA)nV 

=  U*Wto+}ta+,0,A\T°,TaU 

=  Wxv+,xa+,o,a\xo,xa  (5.45) 

The  last  line  follows  from  the  fact  that 

lA*Cq~0+  q-A+  q  a,T° ,TACjx)  q-y^bi  =  Wx°+,XA+ ,0,A,X° ,XAH*  (14Cx°,Xa^*) 

=  Wx°+  ,XA+  ,<D,A,X°  ,XACX°  ,XA 

=  WXo+,XA+,0,A\X°,XA 

Continuing,  we  see  that: 

C(ta,ta)h{M*C(,t°  ,ta)h)'  =  Wta,ta\t°  ,taC(t°  ,ta)h(M*C(t°  ,ta)h)^ 

=  Y\>taATa | T° ,TAUC(X° ,XA)H  (C(X° ,XA)H )  ' 

—  W’fA  ■j'A^q'O  J'aIA 

=  yV'j-A']-A\x°  ,xA  (5.46) 

C(0,0,A)h(U*C(T°  ,Ta)hY  =  W0)0,A\T°  ,TA^(T°  ,Ta)h(M*C(T°  ,Ta)h)^ 

=  Wo,oa\t°  ,ta^^(x°  ,xa)h{C(x°  ,xA)ny 

=  Wo,0,A\T°  ,TaM 

—  W o,o,a\x°,xa  (5-47) 

C(A,A)H(M*C(T0,TA)Hy  =  ^AA\T0,TA^(T0,TA)n^M*^{T0,TA)v)^ 

—  VVa,a\t°  ,taMC(x°  ,xa)h{C(x°  ,xa)h)' 

=  Wa,A\T°,TaM 

—  Wa,a\x°,xa  (5.48) 

Similar  to  Equations  5.34-5.37  we  can  find  covariance  operators  conditioned  on  history  at  time 
t: 


Cx°+ ,xA+ ,o,A\ht  —  Wx°+,xA+,o,A\x°,xACx°,xA\ht  (5.49) 

CrAjA\ht  =  y^TA,TA\x°  ,xA^x°  ,xA\h.t  (5.50) 

Co,<D,A\ht  =  Wo,0,A\X°  ,XACx°  ,XA\ht  (5.51) 

C-A,A\ht  —  Wa,a\x°  ,xACx°  ,xA\ht  (5.52) 

Finally,  we  update  our  state  with  action  at  and  observation  ot  by  applying  KBR  twice: 

Cx°+,XA+,0\ht,at  —  C(X°+  ,XA+  ,0)A\h.tC  AAlht^  (a‘t)  (5.53) 

Co,o\ht,at  =  C(o,o)A\htCA^A\ht(j)A(at )  (5.54) 

Cx°+,XA+\ ht,at,ot  =  ^(XO+,XA+)O\ht,atCo^O\ht,at^°^0t">  (5.55) 
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The  PSR  state  can  now  be  computed: 


Cx°,x*\ht+t  =  Cx°+,X*+\ ht,at,ot 

(5.56) 

CrAjA\ht+i  =  WTA,TA\x0,x*Cx°,xA\ht+x 

(5.57) 

y^T°\TA,ht+i  =  ^Cxo  txA\ht+i^rA,  TA\ht+i 

(5.58) 

5.3  Kernel  Learning  Algorithms  for  PSRs 


The  equations  in  Section  5.2  allow  us  to  estimate  the  appropriate  covariance  operators  and  update 
the  Hilbert  space  embedding  of  the  PSR  state.  If  the  RKHS  embeddings  are  finite ,  then  the 
learning  algorithm  and  state  update  are  the  same  as  Chapter  4.  However,  if  the  RKHS  is  infinite , 
which  is  often  the  case,  then  it  is  not  possible  to  store  or  manipulate  the  above  covariances 
directly.  Instead,  we  use  the  “kernel  trick”  and  represent  all  of  the  covariances  and  compute  state 
updates  with  Gram  matrices. 

Given  T  i.i.d.  tuples  { (rf,  rA,  ot,  at,  ht)  }^=1  from  a  PSR,  we  denote: 


T°  =  (0°(oi),...,0°(oT_i)) 
TA  =  (0'4(ai),...,0-4(aT_i)) 

Th  —  (4>n(hi), . . .  r(f>n(hT_i)) 


(5.59) 


The  learning  algorithm  for  HSE-PSRs  proceeds  to  estimate  covariance  operators  by  Gram  ma¬ 
trices.  The  following  Gram  matrices  are  used  below: 


Gn,n 

(5.60) 

Ga,a 

_  'Y'A*'Y'.A 

(5.61) 

Gq,o 

(5.62) 

Gq~A  q~A 

_  ^ta*^ta 

(5.63) 

Gq-O  q~0 

(5.64) 

Gt°®ta,t°®ta 

=  Gr-j-O  -yO  O  G'jA'yA 

(5.65) 

Gt°®ta,t°+®ta+ 

=  G-j’O  q~0+  o  Gq-A  q~A+ 

(5.66) 

where  o  is  the  Hadamard  (element-wise  matrix)  product. 
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5.3.1  The  Gram  Matrix  Formulation  for  Hilbert  Space  Embeddings  of 
PSRs 

The  PSR  state  is  encoded  as  a  set  of  weights  aht  on  samples.  In  order  to  recover  the  predictive 
state,  we  need  to  compute  the  conditional  covariance  Wro\rA,ht,  which  involves  first  computing 

Ct a.t A\ht  and  CTo trA\ht- 

CTA,TA\h.t  =  Tr-4diagKJTr^  (5.67) 

Ctoja \ht  =  Tr0diagKt)Tr^  (5.68) 

Let  Aht  be  a  diagonal  matrix  with  the  set  of  weights  aht  along  the  diagonal:  Aht  =  diag  (aht). 
To  write  the  predictive  state  at  time  t  in  terms  of  Gram  matrices  we  apply  Equation  5.10: 

^T°\TA,ht  —  Tr  ^htGrA,TA((^htGrA,TAy  +  AT/)  1AhtYr  *  (5.69) 


5.3.2  Gram  Matrix  State  Updates 

Given  the  Gram  matrix  representation  of  state,  we  recursively  apply  kernel  Bayes’  rule  to  update 
state  given  a  new  action  and  observation.  We  start  with  a  set  of  weights  aht  on  samples  and  the 
corresponding  diagonal  matrix  Aht.  After  choosing  action  at,  we  condition  to  find  another  set  of 
weights  (again  via  Equation  5.10): 

aauht  =  A htGAA  ((A thOm)2  +  AT/)-1  A hJA*<j>A(at)  (5.70) 

Equation  5.70  is  the  Gram  matrix  analogue  of  Equation  5.38.  This  vector  can  be  used  to  find  an 
estimate  of  the  conditional  embedding  C0\ht,at  by  weighting  the  samples  T°: 

Co\ht,at  =  A °ahuat  (5.71) 

Given  a  new  observation  ot,  we  apply  KBR  again 

Otht,at,ot  —  A-ht,atGo,o((-h-ht,atGo,o)2  +  AT/)  1  AhuatT°  (j>°  (ot)  (5.72) 


Equation  5.72  is  the  Gram  matrix  analogue  of  Equation  5.40.  Given  the  coefficients  &ht+1,  we 
can  calculate  the  embedding  of  the  conditional  joint  probability  of  future  observation  sequences 
and  future  action  sequences  as 


Cr°+,TA+\ht,at,ot 


Y'T°+®T-a+ 


ah.t,at.,ot 


(5.73) 


To  finish  updating  the  state  we  map  these  coefficients  on  samples  rpT°+®TA+  l0  coefficients  on 
samples  Yt°®tA: 


aht+1  —  Gu,H{Gr°®TAJ°®TAGnpi  +  AT/)  G'j-o®TA,T°+®TA+aht,at,ot 


(5.74) 


Equation  5.74  can  be  viewed  as  the  first  step  of  our  recursive  filtering  algorithm,  that  is,  we 
are  taking  the  state  at  time  t  +  1  and  applying  Equation  5.9  to  begin  the  process  of  finding  the 
conditional  embedding  weights  aat+ ltht+1  etc.  Equation  5.70.  To  continue  updating  the  state,  we 
recursively  apply  Equations  5.70-5.74. 
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5.3.3  A  Spectral  Learning  Algorithm 

The  kernel  spectral  algorithm  for  PSRs  proceeds  by  first  performing  a  ‘thin’  SVD  of  the  sample 
covariance  C(rotTA^n  =  Tiien  the  left  singular  vector  v  =  T T°®rA  a  ( a  e  M7') 

can  be  estimated  as  follows 

C(t°,ta),hC(t°,ta),h*v  —  uv 
^rr°®rAGn,nrr°®TA*v  =  uv 
^rT°®TAGn,nGTo®TA,To®rAa  =  uTT°^Aa 

^  Gt°®ta  j°®rA Gu,uGto®tajo®ta0l  —  w^t0®t-a,tc,(8>T'aQ!>  ( a  e  MT,  u  G  1)  (5.75) 
where  a  is  the  generalized  eigenvector.  After  normalization,  we  have 

7,  =  - - - yT°®TA^  tS  lfi\ 

^GTo^TATo^TAa  K  } 


Then  the  U  operator  is  the  column  concatenation  of  the  d  top  left  singular  vectors,  i.e.  U  = 
(ui, . . . ,  va).  If  we  let  A  =  (on, . . . ,  ad)  G  RTxd  be  the  column  concatenation  of  the  d  top  a*, 
and  D  =  diag  [[alG'j-o^A  -yo^q-Aai)  7/2, . . . ,  (a^G-j-o^-j-Aj-o^-j-Aad)  1^2)  g  Rdxd,  we  can 
concisely  express  U  =  T T°®rA AD.  Therefore,  we  can  compute  the  low-dimensional  embed¬ 
ding: 


Cx°,xA\ht 


U*CTOjA\ht 

U*T  r°®rAaht 

DT AT  G  fO  ^jA  JO  ^q-Aaht 


(5.77) 


Again,  let  Ahf  =  diag  (ah,).  Updating  state  given  a  new  action  and  observation  proceeds  in 
exactly  the  same  way  as  the  non-minimal  case: 


aatM  =  A htGAA  ((A htGAA)2  +  AT/)  1  AhtT A*<PA(at) 

A-ht,at  =  diag  (aat!ht) 

Aht+i  =  OLht)at,ot  =  AhuatGo,o({Aht,atGo,o)2  +  AT/)  1A/lt)atTc>  <p°(ot) 

To  finish  updating  the  state 


aht  + 1 

=  G'H.'H.G'r0  ®TA  ,T°  ®TA  AD(DT  AT  G-j-o^yA  j-o  q'j-aG'h,'H.Gt0  ®Ta  ,T°  ®Ta  AD)  1 DT  AT  G'j-olg)'rA,T0+®TA+®ht+1 

(5.78) 

Although  we  have  not  supplied  any  sample  complexity  results  for  this  algorithm  yet,  we  believe 
that  we  can  place  bounds  on  each  individual  empirically  estimated  covariance  operator,  and  then 
chain  those  bounds  together  to  get  a  bound  on  the  result:  a  similar  approach  was  used  to  obtain 
a  bound  on  a  previous  version  of  this  algorithm  [99]. 
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Figure  5.1:  Robot  vision  data.  (A)  Sample  images  from  the  robot’s  camera.  The  figure  below 
depicts  the  hallway  environment  with  a  central  obstacle  (black)  and  the  path  that  the  robot  took 
through  the  environment  (the  red  counter-clockwise  ellipse).  (B)  Squared  error  for  prediction 
with  different  estimated  models  and  baselines. 


5.4  Experimental  Results 

For  our  experimental  results,  we  implemented  a  slight  variation  on  the  algorithm  described  above 
for  a  special  subclass  of  HSE-PSRs  called  Hilbert  Space  Embeddings  of  Hidden  Markov  Models 
(HSE-HMMs).  HSE-HMMs  are  something  of  a  misnomer:  the  model  is  a  “hidden  Markov 
model”  with  a  potentially  infinite  number  of  states,  it  is  therefore  equivalent  to  an  Observable 
Operator  Model  or  an  uncontrolled  PSR.  As  such  we  can  just  use  the  learning  algorithm  described 
above,  but  ignore  actions.  This  makes  the  math  and  learning  considerably  easier  in  practice. 

In  the  experimental  results  described  here,  we  used  an  older  variation  of  this  algorithm  with 
some  differences  in  the  observation  update  [99].  In  particular,  Song  et  al.  [99]  update  the  em¬ 
bedding  of  the  belief  state  of  the  HMM  given  a  new  observation  ot  with  an  estimated  operator 
B0t,  which  is  a  conditional  mean  update  in  RKHS  multiplied  by  a  conditional  density  estimate 
(which  must  be  estimated  separately).  This  is  a  less  direct  option  than  performing  the  update 
completely  in  the  embedding  space.  We  hypothesize  that  using  kernel  Bayes’  rule,  as  described 
in  the  previous  section  will  is  not  just  more  direct,  but  should  work  better  in  practice.  We  will 
directly  compare  the  two  approaches  in  future  work. 

We  designed  3  sets  of  experiments  to  evaluate  the  effectiveness  of  learning  embedded  HMMs 
for  difficult  real-world  filtering  and  prediction  tasks.  In  each  case  we  compare  the  learned  em¬ 
bedded  HMM  to  several  alternative  time  series  models  including  (I)  linear  dynamical  systems 
(LDS)  learned  by  spectral  methods  (Chapter  2)  with  stability  constraints  (Chapter  7),  (II)  dis¬ 
crete  HMMs  learned  by  EM,  and  (III)  the  Reduced-rank  HMM  (RR-HMM)  learned  by  spectral 
methods  [90].  In  these  experiments  we  demonstrate  that  the  kernel  spectral  learning  algorithm 
for  embedded  HMMs  achieves  the  state-of-the-art  performance. 

5.4.1  Robot  Vision 

In  this  experiment,  a  video  of  2000  frames  was  collected  at  6  Hz  from  a  Point  Grey  Bumblebee2 
stereo  camera  mounted  on  a  Botrics  Obot  dlOO  mobile  robot  platform  circling  a  stationary  ob¬ 
stacle  (under  imperfect  human  control)  (Figure  5.1(A))  and  1500  frames  were  used  as  training 
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Figure  5.2:  Slot  car  inertial  measurement  data.  (A)  The  slot  car  platform  and  the  IMU  (top) 
and  the  racetrack  (bottom).  (B)  Squared  error  for  prediction  with  different  estimated  models  and 
baselines. 


data  for  each  model.  Each  frame  from  the  training  data  was  reduced  to  100  dimensions  via  SVD 
on  single  observations.  The  goal  of  this  experiment  was  to  learn  a  model  of  the  noisy  video,  and, 
after  filtering,  to  predict  future  image  observations. 

We  trained  a  50-dimensional  embedded  HMM  with  tests  consisting  of  sequences  of  20  con¬ 
secutive  observations.  Gaussian  RBF  kernels  are  used  and  the  bandwidth  parameter  is  set  with 
the  median  of  squared  distance  between  training  points  (median  trick).  The  regularization  pa¬ 
rameter  A  is  set  of  10~4.  For  comparison,  a  50-dimensional  RR-HMM  with  Parzen  windows 
is  also  learned  with  sequences  of  20  observations  [90];  a  50-dimensional  FDS  is  learned  using 
Subspace  ID  with  Hankel  matrices  of  20  time  steps;  and  finally  a  50-state  discrete  HMM  and 
axis-aligned  Gaussian  observation  models  is  learned  using  EM  algorithm  run  until  convergence. 

For  each  model,  we  performed  filtering3  for  different  extents  t\  =  100, 101, . . . ,  250,  then 
predicted  an  image  which  was  a  further  t2  steps  in  the  future,  for  t2  =  1,  2...,  100.  The  squared 
error  of  this  prediction  in  pixel  space  was  recorded,  and  averaged  over  all  the  different  filtering 
extents  t1  to  obtain  means  which  are  plotted  in  Figure  5. 1(B).  As  baselines,  we  also  plot  the  error 
obtained  by  using  the  mean  of  filtered  data  as  a  predictor  (Mean),  and  the  error  obtained  by  using 
the  last  filtered  observation  (Fast). 

Any  of  the  more  complex  algorithms  perform  better  than  the  baselines  (though  as  expected, 
the  ‘Fast’  predictor  is  a  good  one-step  predictor),  indicating  that  this  is  a  nontrivial  prediction 
problem.  The  embedded  HMM  learned  by  the  kernel  spectral  algorithm  yields  significantly  lower 
prediction  error  compared  to  each  of  the  alternatives  (including  the  RR-HMM)  consistently  for 
the  duration  of  the  prediction  horizon  (100  timesteps,  i.e.  16  seconds). 

5.4.2  Slot  Car  Inertial  Measurement 

In  a  second  experiment,  the  setup  consisted  of  a  track  and  a  miniature  car  (1:32  scale  model) 
guided  by  a  slot  cut  into  the  track.  Figure  6.2(A)  shows  the  car  and  the  attached  IMU  (an 
Intel  Inertiadot)  in  the  upper  panel,  and  the  14m  track  which  contains  elevation  changes  and 
banked  curves.  At  each  time  step  we  extracted  the  estimated  3-D  acceleration  of  the  car  and  the 

3Update  models  online  with  incoming  observations. 
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estimated  difference  between  the  3-D  orientation  of  the  car  from  the  previous  time  step  at  a  rate 
of  10Hz.  We  collected  3000  successive  measurements  of  this  data  while  the  slot  car  circled  the 
track  controlled  by  a  constant  policy.  The  goal  was  to  leam  a  model  of  the  noisy  IMU  data,  and, 
after  filtering,  to  predict  future  readings. 

We  trained  a  20-dimensional  embedded  HMM  with  tests  consisting  of  sequences  of  150  con¬ 
secutive  observations.  The  bandwidth  parameter  of  the  Gaussian  RBF  kernels  is  set  with  ‘median 
trick’.  The  regularization  parameter  A  is  10-4.  For  comparison,  a  20-dimensional  RR-HMM  with 
Parzen  windows  is  learned  also  with  sequences  of  150  observations;  a  20-dimensional  LDS  is 
learned  using  Subspace  ID  with  Hankel  matrices  of  150  time  steps;  and  finally,  a  20-state  discrete 
HMM  (with  400  level  of  discretization  for  observations)  is  learned  using  EM  algorithm. 

For  each  model,  we  performed  filtering  for  different  extents  t\  =  100, 101, . . .  ,250,  then 
predicted  an  image  which  was  a  further  f2  steps  in  the  future,  for  f2  =  1,  2...,  100.  The  squared 
error  of  this  prediction  in  the  IMU’s  measurement  space  was  recorded,  and  averaged  over  all 
the  different  filtering  extents  ti  to  obtain  means  which  are  plotted  in  Figure  6.2(B).  Again  the 
embedded  HMM  yields  lower  prediction  error  compared  to  each  of  the  alternatives  consistently 
for  the  duration  of  the  prediction  horizon. 

5.4.3  Audio  Event  Classification 

Our  final  experiment  concerns  an  audio  classification  task.  The  data,  recently  presented  in  [81], 
consisted  of  sequences  of  13-dimensional  Mel-Frequency  Cepstral  Coefficients  (MFCC)  ob¬ 
tained  from  short  clips  of  raw  audio  data  recorded  using  a  portable  sensor  device.  Six  classes 
of  labeled  audio  clips  were  present  in  the  data,  one  being  Human  speech.  For  this  experiment 
we  grouped  the  latter  five  classes  into  a  single  class  of  Non-human  sounds  to  formulate  a  binary 
Human  vs.  Non-human  classification  task.  Since  the  original  data  had  a  disproportionately  large 
amount  of  Human  Speech  samples,  this  grouping  resulted  in  a  more  balanced  dataset  with  40 
minutes  11  seconds  of  Human  and  28  minutes  43  seconds  of  Non-human  audio  data.  To  reduce 
noise  and  training  time  we  averaged  the  data  every  100  timesteps  (equivalent  to  1  second). 

For  each  of  the  two  classes,  we  trained  embedded  HMMs  with  10,  20, . . . ,  50  latent  dimen¬ 
sions  using  spectral  learning  and  Gaussian  RBF  kernels  with  bandwidth  set  with  the  ‘median 
trick’.  The  regularization  parameter  A  is  10_1.  For  comparison,  regular  HMMs  with  axis-aligned 
Gaussian  observation  models,  LDSs  and  RR-HMMs  were  trained  using  multi-restart  EM  (to 
avoid  local  minima),  stable  Subspace  ID  and  the  spectral  algorithm  of  [90]  respectively,  also 
with  10, . . . ,  50  latent  dimensions. 

For  RR-HMMs,  regular  HMMs  and  LDSs,  the  class-conditional  data  sequence  likelihood 
is  the  scoring  function  for  classification.  For  embedded  HMMs,  the  scoring  function  for  a 
test  sequence  x1:f  is  the  log  of  the  product  of  the  compatibility  scores  for  each  observation, 

le-Y!r=l  lQg  (M^r),  frxr\z1:r-i  )jr)  • 

For  each  model  size,  we  performed  50  random  2: 1  partitions  of  data  from  each  class  and  used 
the  resulting  datasets  for  training  and  testing  respectively.  The  mean  accuracy  and  95%  confi¬ 
dence  intervals  over  these  50  randomizations  are  reported  in  Figure  5.3.  The  graph  indicates  that 
embedded  HMMs  have  higher  accuracy  and  lower  variance  than  other  standard  alternatives  at 
every  model  size.  Though  other  learning  algorithms  for  HMMs  and  LDSs  exist,  our  experiment 
shows  this  to  be  a  non-trivial  sequence  classification  problem  where  embedded  HMMs  signif- 
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Figure  5.3:  Accuracies  and  95%  confidence  intervals  for  Human  vs.  Non-human  audio  event 
classification,  comparing  embedded  HMMs  to  other  common  sequential  models  at  different  la¬ 
tent  state  space  sizes. 
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icantly  outperform  commonly  used  sequential  models  trained  using  typical  learning  and  model 
selection  methods. 


5.5  Conclusion 

In  this  chapter  we  proposed  an  extension  of  the  feature-based  observable  representation  of  PSRs 
presented  in  Chapter  4.  We  extended  the  finite  features  of  that  chapter  to  Hilbert  spaces,  resulting 
in  Hilbert  space  embeddings  of  PSRs.  The  essence  of  this  new  approach  is  to  represent  distribu¬ 
tions  over  tests  as  elements  in  Hilbert  spaces,  and  update  these  elements  entirely  in  the  Hilbert 
spaces  using  kernel  Bayes’  rule.  This  allows  us  to  derive  a  local-minimum-free  kernel  spectral 
algorithm  for  learning  the  embedded  PSRs.  In  our  experimental  results  we  show  that  a  variation 
of  this  algorithm  [99]  exceeds  previous  state-of-the-art  in  real  world  challenging  problems. 

We  briefly  note  that  it  is  possible  to  extend  the  algorithms  presented  in  this  chapter  to  a 
manifold  dynamical  system  learning  algorithm.  If  we  are  interested  in  modeling  a  dynamical 
system  whose  state  space  lies  on  a  low-dimensional  manifold,  then  we  can  use  this  additional 
knowledge  to  constrain  the  learning  algorithm  and  produce  a  more  accurate  model  for  a  given 
amount  of  training  data.  For  details  see  [13]. 
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Chapter  6 


Computational  Efficiency  in  Spectral 
Learning  Algorithms 


6.1  Introduction 


In  the  previous  chapters,  we  have  described  several  novel  spectral  learning  algorithms  that  can  be 
used  to  leam  models  of  partially  observable  nonlinear  dynamical  systems  such  as  HMMs  [42,  90] 
and  PSRs  [12,  14,  84].  These  algorithms  are  statistically  consistent ,  unlike  the  popular  expecta¬ 
tion  maximization  (EM)  algorithm,  which  is  subject  to  local  optima.  Furthermore,  we  have  seen 
that  spectral  learning  algorithms  are  easy  to  implement  with  a  series  of  linear  algebra  operations. 
Despite  these  attractive  features,  these  algorithms  have  so  far  had  an  important  drawback:  they 
are  batch  methods  (needing  to  store  their  entire  training  data  set  in  memory  at  once)  instead  of 
online  ones  (with  space  complexity  independent  of  the  number  of  training  examples  and  time 
complexity  linear  in  the  number  of  training  examples). 

To  remedy  this  drawback,  we  propose  a  fast,  online  spectral  algorithm  for  PSRs.  Since  PSRs 
subsume  HMMs  and  POMDPs  [84,  93],  the  algorithm  described  in  this  chapter  also  improves  on 
past  algorithms  for  these  other  models.  Our  method  leverages  fast,  low-rank  modifications  of  the 
thin  singular  value  decomposition  [20],  and  uses  tricks  such  as  random  projections  to  scale  to 
extremely  large  numbers  of  examples  and  features  per  example.  Consequently,  the  new  method 
can  handle  orders  of  magnitude  larger  data  sets  than  previous  methods,  and  can  therefore  scale 
to  leam  models  of  systems  that  are  too  complex  for  previous  methods. 

Experiments  show  that  our  online  spectral  learning  algorithm  does  a  good  job  recovering 
the  parameters  of  a  nonlinear  dynamical  system  in  two  partially  observable  domains.  In  our  first 
experiment  we  empirically  demonstrate  that  our  online  spectral  learning  algorithm  is  unbiased  by 
recovering  the  parameters  of  a  small  but  difficult  synthetic  Reduced-Rank  HMM.  In  our  second 
experiment  we  demonstrate  the  performance  of  the  new  method  on  a  difficult,  high-bandwidth 
video  understanding  task. 
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6.2  Batch  Learning  of  PSRs 


In  Chapter  4,  we  presented  a  straightforward  learning  algorithm  for  PSRs:  we  build  empirical 
estimates  of  observable  features  ^ao.-h.aca  £ r,n >  and  £ r+,AO,n  and  compute  U  as  the  matrix 
of  d  leading  left  singular  vectors  of  Sr,w-  Finally,  we  use  the  estimated  covariances  and  U  to 
compute  estimated  PSR  parameters.  One  of  the  advantages  of  subspace  identification  is  that  the 
complexity  of  the  model  can  be  tuned  by  selecting  the  number  of  singular  vectors  in  U,  at  the 
risk  of  losing  prediction  quality. 

As  we  include  more  data  in  our  averages,  the  law  of  large  numbers  guarantees  that  our  es¬ 
timates  converge  to  the  true  matrices  ^ao.h.ao^  £t/h.?  and  £ r+,AO,n ■  So  by  continuity  of  the 
formulas  in  Chapter  4,  if  our  system  is  truly  a  PSR  of  finite  rank,  our  estimated  parameters  con¬ 
verge,  with  probability  1,  to  the  true  parameters  up  to  a  linear  transform — that  is,  our  learning 
algorithm  is  consistent.1 

Unfortunately,  it  is  difficult  to  implement  the  naive  algorithm  in  practice.  For  example,  if 
there  are  a  very  large  number  of  features  of  tests  or  features  of  histories,  we  may  not  be  able  even 
to  store  the  full  parameters  ^ao:h.ao,  £ r,n ,  and  T,t+,ao,h  m  memory.  Therefore,  we  want  to 
use  a  more  efficient  algorithm,  one  that  does  not  explicitly  build  these  parameters.  We  will  start 
by  improving  the  batch  algorithm,  then  make  it  online  in  Section  6.3. 


6.2.1  An  Efficient  Batch  Learning  Algorithm 

The  key  idea  is  to  compute  a  set  of  smaller-sized  intermediate  quantities  from  realizations  of 
characteristic  features  (j)T ,  indicative  features  <pH,  and  observation  features  0AO.  and  then  com¬ 
pute  PSR  parameters  from  these  quantities. 

In  the  batch  setting  we  can  store  £r,w  and  compute  its  rank-c/  singular  value  decomposition 
U,  S,  VT .  Then,  instead  of  computing  ^ao.'h.aca  and  £r  -  ,ao/h  directly,  we  use  the  factors 
U,  S,  VT  to  make  storing  the  other  matrices  and  calculating  the  ultimate  PSR  parameters  much 
more  efficient.  (When  we  discuss  iterative  updating  in  Section  6.3  below  we  don’t  even  have  to 
store  or  compute  its  SVD  directly,  potentially  increasing  our  computational  and  memory 
savings  by  a  substantial  amount.) 

In  more  detail,  we  begin  by  estimating  Ht,u-  Recall  that  each  element  of  our  estimate  £7-,% 
is  an  unnormalized  empirical  expectation  of  the  product  of  one  indicative  feature  and  one  charac¬ 
teristic  feature,  if  we  sample  a  history  from  c u  and  then  follow  an  appropriate  sequence  of  actions. 
We  can  compute  all  elements  of  T.t  -h  from  a  single  sample  of  trajectories  if  we  sample  histories 
from  u,  follow  an  appropriate  exploratory  policy,  and  then  importance-weight  each  sample  [16]: 
Pr,«]«  is  EE.  where  At  is  an  importance  weight. 

Next  we  compute  U SV  .  the  rank- 4  thin  singular  value  decomposition  of  £r,«: 


(U,  S,  V)  =  SVD 


(6.1) 


'Continuity  holds  if  we  fix  U\  a  similar  but  more  involved  argument  works  if  we  estimate  U  as  well. 
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The  left  singular  vectors  U  define  the  state  space  of  the  PSR  and  therefore  play  a  direct  role  in 
the  PSR  learning  algorithm.  However,  the  right  singular  vectors  V  and  singular  values  S  can 
also  be  used  to  make  computation  of  the  other  PSR  parameters  more  efficient. 

First,  and  most  obviously,  we  can  directly  compute  the  estimate  of  an  initial  feasible  state  5* 
from  Equation  4.2a  using  S  and  V.  The  key  is  noting  that  UtYI'7-,h  —  SVT .  Then 

k  =  SVTe  (6.2a) 

More  importantly,  we  can  compute  the  Bayes’  rule  state  update  without  computing  the  full  ten¬ 
sors  Tjao,h,ao,  and  E r+,Ao,n ■  To  see  that,  recall  that  the  Bayes  rule  state  update  is  given  by 
(Equation  4.22): 

bt+i  =  UtT,t+,ao  ((UtEt,h)%)  {^ao,ao  {(U^r^bt))'1  <t>?° 

Therefore,  instead  of  computing  S-r+,-4 o,n  =  YltL i  ®  ®  we  can  instead  compute 

the  much  smaller  tensor  E b+,ao\b  =  f/TSr+  ,^e>,^(f/TSriW)t  directly: 

W 

Zb+,ao\b  =  (pT(j>T)  ®  ®  (6.2b) 

t= i 

Similarly,  instead  of  estimating  E ao,h,ao  =  Y17=i  xtW°  ®  ®  W°  we  compute  the  smaller 

tensor 

W 

Zao,ao\b  =  t  (<^°)  ®  (6.2c) 

t= i 

Finally,  we  can  compute  Bayes’  rule  as 

bt+i  =  UT Et+iMD  ((UTZT,n)%)  (T,ao,ao  ({U^r^bt))-1  <j>f> 

=  UTT,r+,Ao  {(UTT,r,n)^bt)  ^E ao,ao\b  (M)  ° 

=  ^b+,ao\b  ( bt )  i^pAO,AO\B  (h)  j  (f>t °  (6. 2d) 

In  summary,  this  learning  algorithm  works  well  when  the  number  of  features  of  tests,  his¬ 
tories,  and  action-observation  pairs  is  relatively  small,  and  in  cases  where  data  is  collected  in 
batch.  These  restrictions  can  be  limiting  for  many  real-world  data  sets.  In  practice,  the  num¬ 
ber  of  features  may  need  to  be  quite  large  in  order  to  accurately  estimate  the  parameters  of  the 
PSR.  Additionally,  we  are  often  interested  in  estimating  PSRs  from  massive  datasets,  updating 
PSR  parameters  given  a  new  batch  of  data,  or  learning  PSRs  online  from  a  data  stream.  Below 
we  develop  several  computationally  efficient  extensions  to  overcome  these  practical  obstacles  to 
learning  in  real-world  situations. 

6.3  Iterative  Updates  to  PSR  Parameters 

We  first  attack  the  problem  of  updating  existing  PSR  parameters  given  a  batch  of  new  informa¬ 
tion.  Next,  we  look  at  the  special  case  of  updating  PSR  parameters  in  an  online  setting  (batch 
size  1),  and  develop  additional  optimizations  for  this  situation. 
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6.3.1  Batch  Updates 

We  first  present  an  algorithm  for  updating  existing  PSR  parameters  given  a  new  batch  of  char¬ 
acteristic  features  ©lew,  one-step  removed  characteristic  features  0^ew+,  indicative  features 
and  observation  features  4>^.  Naively,  we  can  just  store  empirical  estimates  and  update  them 
from  each  new  batch  of  data:  STjW  +  0^cw0^ewT,  £AO,HAO+&nZ®<l)new®<f>nZ’ and  £t+,ao,h  + 
Y^t= i  finewA  ®  0new  ®  ^Hew  Then,  after  each  batch,  we  can  learn  new  PSR  parameters. 

This  naive  algorithm  is  very  inefficient:  it  requires  storing  £Ue>,«,.4e>,and  '^r+.AO.n- 

updating  these  tensors  given  new  information,  and  recomputing  the  PSR  parameters.  However, 
as  we  have  seen,  it  is  also  possible  to  write  the  PSR  parameters  in  terms  of  a  set  of  lower¬ 
dimensional  memory-efficient  matrices  and  tensors  (Equations  6.2a-d),  made  possible  by  the 
singular  value  decomposition  of  Et,h-  The  key  idea  is  to  update  these  lower-dimensional  ma¬ 
trices  directly,  instead  of  the  naive  updates  suggested  above,  by  taking  advantage  of  numerical 
algorithms  for  updating  singular  value  decompositions  efficiently  [20]. 

The  main  computational  savings  come  from  using  incremental  SVD  to  update  U,  S,  V,  E b+,ao\b, 
and  S ao,ao\b  directly.  The  incremental  update  for  U,  S,  V  is  much  more  efficient  than  the  naive 
additive  update  when  the  number  of  new  data  points  is  much  smaller  than  the  number  of  features 
in  cj)T  and  ©H.  The  incremental  updates  for  ^b+,ao\b  and  ^ao.ao\h  save  time  and  space  when 
the  latent  dimension  d  is  much  smaller  than  the  number  of  features  in  ©T  and  0H. 

Our  goal  is  therefore  to  compute  the  updated  SVD, 

(Unew,  SDew,  Vnew)  =  SVD  (r,r>n  +  0newA0new  )  > 

where  A  is  a  diagonal  matrix  of  importance  weights  A  =  diag(Ai:t).  We  will  derive  the  incre¬ 
mental  SVD  update  in  two  steps.  First,  if  the  new  data  cT  w  and  (j)^ew  lies  entirely  within  the 
column  spaces  of  U  and  V  respectively,  then  we  can  find  Snew  by  projecting  both  the  new  and 
old  data  onto  the  subspaces  defined  by  U  and  V,  and  diagonalizing  the  resulting  small  (n  x  n) 
matrix: 

SVD  (f)T  (&r,K  +  C„A V) 

svd  (s  +  pTt)A(i?Tt)T) 

and  l/new  =  VV  .  the  rotations  of  U  and  V  induced  by  the 

new  data. 

If  the  new  data  does  not  lie  entirely  within  the  column  space  of  U  and  V,  we  can  update  the 
SVD  efficiently  (and  optionally  approximately)  following  Brand  [20].  The  idea  is  to  split  the  new 
data  into  a  part  within  the  column  span  of  U  and  V  and  a  remainder,  and  use  this  decomposition 
to  construct  a  small  matrix  to  diagonalize  as  above. 

Let  C  and  D  be  orthonormal  bases  for  the  component  of  the  column  space  of  cT  w  orthogonal 
to  U  and  the  component  of  the  column  space  of  orthogonal  to  V : 

C  =  orth  ((/  -  UU^ew)  (6.3a) 

D  =  orth  ((/  -  VVT)^ew)  (6.3b) 


(W,Snew,V>  = 


We  can  then  compute  Unew  =  UU 


T 
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The  dimension  of  C  and  D  is  upper-bounded  by  the  number  of  data  points  in  our  new  batch,  or 
the  number  of  features  of  tests  and  histories,  whichever  is  smaller.  (If  the  dimension  is  large, 
the  orthogonalization  step  above  (as  well  as  other  steps  below)  may  be  too  expensive;  we  can 
accommodate  this  case  by  splitting  a  large  batch  of  examples  into  several  smaller  batches.)  Let 


K 


'  s 

0  ' 

+ 

'  UT  ' 

_  0 

0  _ 

.  cT  . 

Ot '  [ VD] 
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and  diagonalize  K  to  get  the  update  to  S: 

(U,  Snew,  V)  =  SVD  ( K ) 


(6.3c) 


(6.3d) 


Finally,  as  above,  U  and  V  rotate  the  extended  subspaces  [  U  C  ]  and  [  V  D  ] : 

(6.3e) 
(6.3f) 


Unew  =  [UC]U 
Vnew  =  [VD]V 


Note  that  if  there  are  components  orthogonal  to  U  and  V  in  the  new  data  (i.e.,  if  C  and  I)  are 
nonempty),  the  size  of  the  thin  SVD  will  grow.  So,  during  this  step,  we  may  choose  to  tune  the 
complexity  of  our  estimated  model  by  restricting  the  dimensionality  of  the  SVD.  If  we  do  so, 
we  may  lose  information  compared  to  a  batch  SVD:  if  future  data  causes  our  estimated  leading 
singular  vectors  to  change,  the  dropped  singular  vectors  may  become  relevant  again.  However, 
empirically,  this  information  loss  can  be  minimal,  especially  if  we  keep  extra  “buffer”  singular 
vectors  beyond  what  we  expect  to  need. 

Also  note  that  the  above  updates  do  not  necessarily  preserve  orthonormality  of  Unew  and 
I4ew,  due  to  discarded  nonzero  singular  values  and  the  accumulation  of  numerical  errors.  To 
correct  for  this,  every  few  hundred  iterations,  we  re-orthonormalize  using  a  QR  decomposition 
and  a  SVD: 


(Uq,Ur) 

(Vq,  Vr) 


(Uqr,  Sqr,  Vqr) 

6)iew 
In  p  w 
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QR  fine w) 

QR  (knew) 

SVD  (uRSnewV^ 
UqUQr 

VqVqr 

Sqr 


The  updated  SVD  now  gives  us  enough  information  to  compute  the  updates  to  E b-.ao\b  and 
S>ao,ao\b-  Let  A  be  the  diagonal  tensor  of  importance  weights  A  i:iji  =  Xfi  e  1,  2, . . . ,  t).  Using 
the  newly  computed  subspaces  f/new,  Snew,  and  I4ew,  we  can  compute  additive  updates  from  the 
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new  data.  First,  we  compute  the  updated  tensor  E B+,AO\Bnevr • 


E B+,AO\Bne„  -  (  2 T+,AO,H  +  A  Xi  </>new+  X2  4> new  X3 


bH 

new 


xi  C  xs  sJwvZw 

_  y^b+,ao  1  gs+,^ie> 

^update  '  ^new 


(6.4) 


'^B+,AO 


where  E update  can  be  viewed  as  the  projection  of  T,b+,ao\ b  onto  the  new  subspace: 


=  s T*Mn  X,  um  x3  s^v; 

E_b+,^c>|b  0 

0  0 


-T  f>T 
new 


X  (  S'-1  \/T 

XX  3  \  Jnew  v  new 


Xl  yU new 
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u  0 


and  S^ewAO  is  the  projection  of  the  additive  update  onto  the  new  subspace: 


W°  =  A  Xi  (£/;ew^w+)  x2  «S)  x3  (S^VZJ 


*  ') 
newy 


Next  we  compute  the  updated  tensor  Vao.ao\b„kV 


"S AO ,AO\Bnevi  ~  (^E AO,H.AO  +  A  Xi  (j)A^  X2  (j) new  X3  </>■ 

Xo^-IkT 


new 


'new  r  new 
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^update  '  ^new 

' — '  AO  j\CO  ^ 

where  Eupdate  can  be  viewed  as  the  projection  of  'Eao,ao\ b  onto  the  new  subspace: 


(6.5) 


iAAO,AO  q— A  t/ 

^“update  E AO,H,AO  X2  *5newPr 
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and  E^AO  is  the  projection  of  the  additive  update  onto  the  new  subspace: 

VZAO  =  A  x,  (O  x2  (tttl  x3  «S) 

Once  the  updated  estimates  in  Equation  6.4  and  Equation  6.5  have  been  calculated,  we  can 
compute  the  new  PSR  Bayes’  rule  update  as 


h+ 1  —  ^B+,AO\Bn 


w  (s 


AO,AO\Bn 


(/>,))  ~1<k] 


lAO 


(6.6) 
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6.3.2  Online  updates 

In  the  online  setting  (with  just  one  sample  per  batch),  the  updates  to  S  are  rank- 1 ,  allowing  some 
additional  efficiencies.  We  compute  the  rank-1  update  to  the  matrices  U,  S,  and  VT .  We  can 
compute  C  and  D  efficiently  via  a  simplified  Gram-Schmidt  step  [20]: 

C  —  (I  -  UUT)(pl 

c  =  c/\\c\\ 

D  —  (I  —  VVT)(ff 
D  =  D/\\D\\ 


Finally,  we  can  compute  K  by  adding  a  rank- 1  matrix  to 


S: 


K 


'  s 

0  ' 
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'  UT  ' 

_  0 
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.  cT  . 

<t>[ 


VD], 


We  then  compute  the  updated  parameters  using  Eqs.  6.3d-f. 

The  online  update  incurs  significant  computational  cost  due  to  the  fact  that  we  must  compute 
a  SVD  at  each  time  step.  Therefore,  the  online  updates  are  not  worth  the  computation  time  unless 
new  parameters  are  truly  needed  after  each  observation.  By  contrast,  updating  the  parameters 
with  small  batches  of  new  information  provides  a  good  tradeoff  between  batch  and  online  updates 
when  the  efficient  batch  algorithm  is  computationally  intractable. 


6.4  Random  Projections  for  High  Dimensional  Feature  Spaces 

Despite  their  simplicity  and  wide  applicability,  HMMs,  POMDPs,  and  PSRs  are  limited  in  that 
they  are  usually  restricted  to  discrete  observations,  and  the  state  is  usually  restricted  to  have 
only  moderate  cardinality.  In  Chapter  4,  we  described  a  feature -based  representation  for  PSRs 
that  relaxes  this  restriction.  In  Chapter  5  and  Song  et  al.  [99]  we  proposed  a  spectral  learning 
algorithm  for  PSRs  and  HMMs  with  continuous  observations  by  representing  distributions  over 
these  observations  and  continuous  latent  states  as  embeddings  in  an  infinite  dimensional  Hilbert 
space.  These  Hilbert  Space  Embeddings  of  PSRs  (HSE-PSRs)  and  HMMs  (HSE-HMMs)  use 
essentially  the  same  framework  as  other  spectral  learning  algorithms  for  HMMs  and  PSRs,  but 
avoid  working  in  the  infinite-dimensional  Hilbert  space  by  the  well-known  “kernel  trick.” 

HSE-HMMs  have  been  shown  to  perform  well  on  several  real-world  datasets,  often  beating 
the  next  best  method  by  a  substantial  margin.  However,  they  scale  poorly  due  to  the  need  to  work 
with  the  kernel  matrix,  whose  size  is  quadratic  in  the  number  of  training  points. 

We  can  overcome  this  scaling  problem  and  learn  PSRs  that  approximate  HSE-HMMs  using 
random  features  for  kernel  machines  [80]:  we  construct  a  large  but  finite  set  of  random  features 
which  let  us  approximate  a  desired  kernel  using  ordinary  dot  products.  (Rahimi  and  Recht  show 
how  to  approximate  several  popular  kernels,  including  radial  basis  function  (RBF)  kernels  and 
Laplacian  kernels.)  The  benefit  of  random  features  is  that  we  can  use  fast  linear  methods  that  do 
not  depend  on  the  number  of  data  points  to  approximate  the  original  kernel  machine. 
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Figure  6.1:  A  synthetic  RR-HMM.  (A.)  The  eigenvalues  of  the  true  transition  matrix.  (B.)  RMS 
error  in  the  nonzero  eigenvalues  of  the  estimated  transition  matrix  vs.  number  of  training  samples, 
averaged  over  10  trials.  The  error  steadily  decreases,  indicating  that  the  PSR  model  is  becoming 
more  accurate,  as  we  incorporate  more  training  data. 


HSE-HMMs  are  no  exception:  using  random  features  of  tests  and  histories,  we  can  approx¬ 
imate  a  HSE-HMM  with  a  PSR.  If  we  combine  random  features  with  the  above  online  learning 
algorithm,  we  can  approximate  an  HSE-HMM  very  closely  by  using  an  extremely  large  number 
of  random  features.  Such  a  large  set  of  features  would  overwhelm  batch  spectral  learning  algo¬ 
rithms,  but  our  online  method  allows  us  to  approximate  an  HSE-HMM  very  closely,  and  scale 
HSE-HMMs  to  orders  of  magnitude  larger  training  sets  or  even  to  streaming  datasets  with  an 
inexhaustible  supply  of  training  data. 


6.5  Experimental  Results 

We  designed  3  sets  of  experiments  to  evaluate  the  statistical  properties  and  practical  potential  of 
our  online  spectral  learning  algorithm.  In  the  first  experiment  we  show  the  convergence  behavior 
of  the  algorithm.  In  the  second  experiment  we  show  how  online  spectral  learning  combined  with 
random  projections  can  be  used  to  learn  a  PSR  that  closely  approximates  the  performance  of  a 
HSE-HMM.  In  the  third  experiment  we  demonstrate  how  this  combination  allows  us  to  model  a 
high-bandwidth,  high-dimensional  video,  where  the  amount  of  training  data  would  overwhelm 
a  kernel-based  method  like  HSE-HMMs  and  the  number  of  features  would  overwhelm  a  PSR 
batch  learning  algorithm. 


6.5.1  A  Synthetic  Example 

First  we  demonstrate  the  convergence  behavior  of  our  algorithm  on  a  difficult  synthetic  HMM 
from  Siddiqi  et  al.  [90].  This  HMM  is  2-step  observable,  with  4  states,  2  observations,  and  arank- 
3  transition  matrix.  (So,  the  HMM  is  reduced  rank  (an  “RR-HMM”)  and  features  of  multiple 
observations  are  required  to  disambiguate  state.)  The  transition  matrix  T  and  the  observation 
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Figure  6.2:  Slot  car  inertial  measurement  data.  (A)  The  slot  car  platform:  the  car  and  IMU  (top) 
and  the  racetrack  (bottom).  (B)  Squared  error  for  prediction  with  different  estimated  models. 
Dash-dot  shows  the  baseline  of  simply  predicting  the  mean  measurement  on  all  frames. 


matrix  O  are: 
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We  sample  observations  from  the  true  model  and  then  estimate  the  model  using  the  algorithm 
of  Section  6.3.2.  Since  we  only  expect  to  recover  the  transition  matrix  up  to  a  similarity  trans¬ 
form,  we  compare  the  eigenvalues  of  B  —  B0  in  the  learned  model  to  the  eigenvalues  of  the 
transition  matrix  T  of  the  true  model.  Fig.  6.1  shows  that  the  learned  eigenvalues  converge  to  the 
true  ones  as  the  amount  of  data  increases. 

6.5.2  Slot  Car  Inertial  Measurement 

In  a  second  experiment,  we  compare  the  online  spectral  algorithm  with  random  features  to  HSE- 
HMMs  with  Gaussian  RBF  kernels.  The  setup  consisted  of  a  track  and  a  miniature  car  (1:32 
scale)  guided  by  a  slot  cut  into  the  track  [99].  Figure  6.2(A)  shows  the  car  and  the  attached  IMU 
(an  Intel  Inertiadot),  as  well  as  the  14m  track,  which  contains  elevation  changes  and  banked 
curves.  We  collected  the  estimated  3D  acceleration  and  velocity  of  the  car  at  10Hz.  The  data 
consisted  of  3000  successive  measurements  while  the  slot  car  circled  the  track  controlled  by  a 
constant  policy.  The  goal  was  to  learn  a  model  of  the  noisy  IMU  data,  and,  after  filtering,  to 
predict  future  readings. 

We  trained  a  20-dimensional  HSE-HMM  using  the  algorithm  of  Song  et  al.,  with  tests  and 
histories  consisting  of  150  consecutive  observations.  We  set  the  bandwidth  parameter  of  the 
Gaussian  RBF  kernels  with  the  “median  trick,”  and  the  regularization  (ridge)  parameter  was 
1CT4.  For  details  see  Song  et  al.  (2010). 
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Next  we  trained  a  20-dimensional  PSR  with  random  Fourier  features  to  approximate  the 
Gaussian  RBF  kernel.  We  generated  25000  features  for  the  tests  and  histories  and  400  features 
for  current  observations,  and  then  used  the  online  spectral  algorithm  to  learn  a  model.  Finally,  to 
provide  some  context,  we  learned  a  20-state  discrete  HMM  (with  400  levels  of  discretization  for 
observations)  using  the  Baum-Welch  EM  algorithm  run  until  convergence. 

For  each  model,  we  performed  filtering  for  different  extents  t\  =  100, 101, . . .  ,250,  then 
predicted  an  image  which  was  a  further  t2  =  1,  2, . . . ,  100  steps  in  the  future.  The  squared 
error  of  this  prediction  in  the  IMU’s  measurement  space  was  recorded,  and  averaged  over  all  the 
different  filtering  extents  t\  to  obtain  means  which  are  plotted  in  Figure  6.2(B). 

The  results  demonstrate  that  the  online  spectral  learning  algorithm  with  a  large  number  of 
random  Fourier  features  does  an  excellent  job  matching  the  performance  of  the  HSE-HMM,  and 
suggest  that  the  online  spectral  learning  algorithm  is  a  viable  alternative  to  HSE-HMMs  when 
the  amount  of  training  data  grows  large. 

6.5.3  Modeling  Video 

Next  we  look  at  the  problem  of  mapping  from  video:  we  collected  a  sequence  of  1 1,000  160  x  120 
grayscale  frames  at  24  fps  in  an  indoor  environment  (a  camera  circling  a  conference  room,  occa¬ 
sionally  switching  directions;  each  full  circuit  took  about  400  frames).  This  data  was  collected  by 
hand,  so  the  camera’s  trajectory  is  quite  noisy.  The  high  frame  rate  and  complexity  of  the  video 
mean  that  learning  an  accurate  model  requires  a  very  large  dataset.  Unfortunately,  a  dataset  of 
this  magnitude  makes  learning  an  HSE-HMM  difficult  or  impossible:  e.g.,  the  similar  but  less 
complex  example  of  Song  et  al.  used  only  1500  frames. 

Instead,  we  used  random  Fourier  features  and  an  online  PSR  to  approximate  a  HSE-HMM 
with  Gaussian  RBF  kernels.  We  used  tests  and  histories  based  on  400  sequential  frames  from 
the  video,  generated  100,000  random  features,  and  learned  a  50-dimensional  PSR.  To  duplicate 
this  setup,  the  batch  PSR  algorithm  would  have  to  find  the  SVD  of  a  1 00,000  x  100,000  matrix; 
by  contrast,  we  can  efficiently  update  our  parameters  by  incorporating  100,000-element  feature 
vectors  one  at  a  time  and  maintaining  50  x  50  and  50  x  100,000  matrices. 

Figure  6.3  shows  our  results.  The  final  learned  model  does  a  surprisingly  good  job  at  captur¬ 
ing  the  major  features  of  this  environment,  including  both  the  continuous  location  of  the  camera 
and  the  discrete  direction  of  motion  (either  clockwise  or  counterclockwise).  Furthermore,  the 
fact  that  a  general-purpose  online  algorithm  learns  these  manifolds  is  a  powerful  result:  we  are 
essentially  performing  simultaneous  localization  and  mapping  in  a  difficult  loop  closing  scenario, 
without  any  prior  knowledge  (even,  say,  that  the  environment  is  three-dimensional,  or  whether 
the  sensor  is  a  camera,  a  laser  rangefinder,  or  something  else). 


6.6  Conclusions 

We  presented  spectral  learning  algorithms  for  PSR  models  of  partially-observable  nonlinear  dy¬ 
namical  systems.  In  particular,  we  showed  how  to  update  the  parameters  of  a  PSR  given  new 
batches  of  data,  and  built  on  these  updates  to  develop  an  efficient  online  spectral  learning  algo¬ 
rithm.  We  also  showed  how  to  use  random  projections  in  conjunction  with  PSRs  to  efficiently 
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Figure  6.3:  Modeling  video.  (A.)  Schematic  of  the  camera’s  environment.  (B.)  The  second  and 
third  dimension  of  the  learned  belief  space  (the  first  dimension  contains  normalization  informa¬ 
tion).  Points  are  colored  red  when  the  camera  is  traveling  clockwise  and  blue  when  traveling 
counterclockwise.  The  learned  state  space  separates  into  two  manifolds,  one  for  each  direction, 
connected  at  points  where  the  camera  changes  direction.  (The  manifolds  appear  on  top  of  one 
another,  but  are  separated  in  the  fourth  latent  dimension.)  (C.)  Loop  closing:  estimated  historical 
camera  positions  after  100,  350,  and  600  steps.  Red  star  indicates  current  camera  position.  The 
camera  loops  around  the  table,  and  the  learned  map  “snaps”  to  the  correct  topology  when  the 
camera  passes  its  initial  position. 


approximate  HSE-HMMs.  The  combination  of  random  projections  and  online  updates  allows  us 
to  take  advantage  of  powerful  Hilbert  space  embeddings  while  handling  training  data  sets  that 
are  orders  of  magnitude  larger  than  previous  methods,  and  therefore,  to  learn  models  that  are  too 
complex  for  previous  methods. 
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Part  II 

Spectral  Learning  Algorithms  in  Practice 
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Chapter  7 

Stability  in  Linear  Dynamical  Systems 


A  major  difficulty  in  learning  LDSs  is  that  standard  learning  algorithms,  both  spectral  and  like¬ 
lihood  based  methods,  can  result  in  models  with  unstable  dynamics,  which  causes  them  to  be 
ill-suited  for  several  important  tasks  such  as  simulation  and  long-term  prediction.  This  problem 
can  arise  even  when  the  underlying  dynamical  system  emitting  the  data  is  stable,  particularly 
if  insufficient  training  data  is  available,  which  is  often  the  case  for  high-dimensional  temporal 
sequences.  In  this  chapter  we  propose  an  extension  to  Subspace  ID  that  enforces  the  estimated 
parameters  to  be  stable.  Though  stability  is  a  non-convex  constraint,  we  will  see  how  a  constraint- 
generation-based  optimization  approach  yields  approximations  to  the  optimal  solution  that  are 
more  efficient  and  more  accurate  than  previous  state-of-the-art  stabilizing  methods. 


7.1  Introduction 

We  propose  an  optimization  algorithm  for  learning  the  dynamics  matrix  of  an  LDS  while  guar¬ 
anteeing  stability.  We  first  obtain  an  estimate  of  the  underlying  state  sequence  using  subspace 
identification.  We  then  formulate  the  least-squares  minimization  problem  for  the  dynamics  ma¬ 
trix  as  a  quadratic  program  (QP)  [18],  initially  without  constraints.  When  we  solve  this  QP,  the 
estimate  A  we  obtain  may  be  unstable.  However,  any  unstable  solution  allows  us  to  derive  a 
linear  constraint  which  we  then  add  to  our  original  QP  and  re-solve.  This  constraint  is  a  conser¬ 
vative  approximation  to  the  true  feasible  region.  The  above  two  steps  are  iterated  until  we  reach 
a  stable  solution,  which  is  then  refined  by  a  simple  interpolation  to  obtain  the  best  possible  stable 
estimate.  The  overall  algorithm  is  illustrated  in  Figure  7.1. 

Our  method  can  be  viewed  as  constraint  generation  [40]  for  an  underlying  convex  program 
with  a  feasible  set  of  all  matrices  with  singular  values  at  most  1,  similar  to  work  in  control 
systems  such  as  [57].  This  convex  set  approximates  the  true,  non-convex  feasible  region.  So, 
we  terminate  before  reaching  feasibility  in  the  convex  program,  by  checking  for  matrix  stability 
after  each  new  constraint.  This  makes  our  algorithm  less  conservative  than  previous  methods 
for  enforcing  stability  since  it  chooses  the  best  of  a  larger  set  of  stable  dynamics  matrices.  The 
difference  in  the  resulting  stable  systems  is  noticeable  when  simulating  data.  The  constraint 
generation  approach  also  results  in  much  greater  efficiency  than  previous  methods  in  nearly  all 
cases. 


75 


One  application  of  LDSs  in  computer  vision  is  learning  dynamic  textures  from  video  data  [96]. 
An  advantage  of  learning  dynamic  textures  is  the  ability  to  play  back  a  realistic  looking  gener¬ 
ated  sequence  of  any  desired  duration.  In  practice,  however,  videos  synthesized  from  dynamic 
texture  models  can  quickly  become  degenerate  because  of  instability  in  the  underlying  LDS.  In 
contrast,  sequences  generated  from  dynamic  textures  learned  by  our  method  remain  sane  even 
after  arbitrarily  long  durations.  We  also  apply  our  algorithm  to  learning  baseline  dynamic  models 
of  over-the-counter  (OTC)  drug  sales  for  biosurveillance,  and  sunspot  numbers  from  the  UCR 
archive  [54].  Comparison  to  the  best  alternative  methods  [57]  on  these  problems  yields  positive 
results. 


7.2  Learning  Stable  Linear  Dynamical  Systems 

The  spectral  algorithm  in  Section  3.4  does  not  enforce  stability  in  A  which  can  cause  problems 
when  predicting  and  simulating  from  an  LDS  learned  from  data.  To  account  for  stability,  we  first 
formulate  the  dynamics  matrix  learning  problem  as  a  quadratic  program  with  a  feasible  set  that 
includes  the  set  of  stable  dynamics  matrices.  Then  we  demonstrate  how  instability  in  its  solutions 
can  be  used  to  generate  linear  constraints  that  are  added  to  the  QP  to  restrict  this  feasible  set 
appropriately.  As  a  final  step,  the  solution  is  refined  to  be  as  close  as  possible  to  the  least-squares 
estimate  while  remaining  stable.  We  compare  our  results  algorithm  to  two  previous  approaches 
from  control  theory,  LB-1  and  LB-2,  in  our  experimental  results  [57,  58]. 1  The  overall  algorithm 
is  illustrated  in  Figure  7.1(A).  We  now  explain  the  algorithm  in  more  detail. 

7.2.1  Formulating  the  Objective 

In  our  spectral  learning  algorithm  for  Kalman  filters,  it  is  possible  to  write  the  objective  function 
for  A  as  a  quadratic  function.  For  subspace  ID  we  define  a  quadratic  objective  function: 


A  =  arg nrin  || AT,t>h  - 


=  arg  nrin  j 

tr  (AT,jrn  —  Sjr+iW)T  (AT,jrn  —  j 

=  arg  min  < 
A 

|\r  (AE^Ej-^A1’)  —  2tr  ^A)  +  tr  (e£+  HTijr+^  j 

=  arg  min  { aT Pa  —  2  qTa  +  r} 

a  L  J 

(7.1a) 

where  a  e  Mn2xl,  q  e  Mn2xl,  P  €  Mn2xn~  and  r  e  M  are  defined  as: 

a  —  vec(A)  —  [An  A2i  A31  ■  ■  •  Ann]T 

(7.1b) 

P  =  In® 

(7.1c) 

q  =  vec  (Sj-iWS^+iW) 

(7.  Id) 

r  =  tr  (Sj-+ 

(7.1e) 

1  For  detailed  comparisons  between  our  algorithm  and  these  approaches,  see  [89]. 
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Figure  7.1:  (A):  Conceptual  depiction  of  the  space  ofnxn  matrices.  The  region  of  stability  (,S\) 
is  non-convex  while  the  smaller  region  of  matrices  with  07  <  1  (Sa)  is  convex.  The  elliptical 
contours  indicate  level  sets  of  the  quadratic  objective  function  of  the  QP.  A  is  the  unconstrained 
least-squares  solution  to  this  objective.  ALb-i  is  the  solution  found  by  LB-1  [57].  One  iteration 
of  constraint  generation  yields  the  constraint  indicated  by  the  line  labeled  ‘generated  constraint’, 
and  (in  this  case)  leads  to  a  stable  solution  A*.  The  final  step  of  our  algorithm  improves  on  this 
solution  by  interpolating  A*  with  the  previous  solution  (in  this  case,  A)  to  obtain  A*jinal.  (B): 
The  actual  stable  and  unstable  regions  for  the  space  of  2  x  2  matrices  Ea^  =  [0.3  a  ;  f3  0.3  ], 
with  a, /3  G  [—10, 10].  Constraint  generation  is  able  to  learn  a  nearly  optimal  model  from  a 
noisy  state  sequence  of  length  7  simulated  from  E0tW,  with  better  state  reconstruction  error  than 
either  LB-1  or  LB-2.  The  matrices  E10  0  and  E010  are  stable,  but  their  convex  combination 
L/5  5  =  0.5i7io,o  +  (1  —  0.5)770,io  is  unstable. 

In  is  the  n  x  n  identity  matrix  and  0  denotes  the  Kronecker  product.  Note  that  P  (which  is  not 
related  to  the  P  in  Section  2.3.2)  is  a  symmetric  positive  semi-definite  matrix  and  the  objective 
function  in  Equation  7.1a  is  a  quadratic  function  of  a. 


7.2.2  Generating  Constraints 

The  feasible  set  of  the  quadratic  objective  function  is  the  space  of  all  n  x  n  matrices,  regardless  of 
their  stability.  When  its  solution  yields  an  unstable  matrix,  the  spectral  radius  of  A  (i.e.  |Ai(A)|) 
is  greater  than  1.  Ideally  we  would  like  to  use  A  to  calculate  a  convex  constraint  on  the  spectral 
radius.  However,  consider  the  class  of  2x2  matrices:  Ea^  =  [0.3  a  ;  (3  0.3]  [71].  The  matrices 
7710jo  and  f70,io  are  stable  with  Ai  =  0.3,  but  their  convex  combination  y-Eio.o  +  (1  —  7)^0, 10 
is  unstable  for  (e.g.)  7  =  0.5  (Figure  7.1(B)).  This  shows  that  the  set  of  stable  matrices  is  non- 
convex  for  n  =  2,  and  in  fact  this  is  true  for  all  n  >  1.  We  turn  instead  to  the  largest  singular 
value ,  which  is  a  closely  related  quantity  since 


<  |Ai(i)|  <  (Tmax{A)  Vi  =  1, . . . ,  n  [39] 


77 


Therefore  every  unstable  matrix  has  a  singular  value  greater  than  one,  but  the  converse  is  not 
necessarily  true.  Moreover,  the  set  of  matrices  with  ay  <  1  is  convex2.  Figure  7.1(A)  conceptu¬ 
ally  depicts  the  non-convex  region  of  stability  S\  and  the  convex  region  Sa  with  ay  <  1  in  the 
space  of  all  n  x  n  matrices  for  some  fixed  n.  The  difference  between  Sa  and  S\  can  be  signif¬ 
icant.  Figure  7.1(B)  depicts  these  regions  for  with  a,  /3  e  [—10, 10].  The  stable  matrices 
if] o.o  and  770,io  reside  at  the  edges  of  the  figure.  While  results  for  this  class  of  matrices  vary 
based  on  the  instance  used,  the  constraint  generation  algorithm  described  below  is  able  to  learn 
a  nearly  optimal  model  from  a  noisy  state  sequence  of  r  =  7  simulated  from  £70,io,  with  better 
state  reconstruction  error  than  LB-1  and  LB -2. 

Let  A  =  UTjVt  by  SVD,  where  U  =  [tq]"=1  and  V  =  [ty]”=1  and  E  =  diag{oy, . . . ,  on}. 
Then: 

A  =  UtVT=>  t  =  UTAv=>  d1{A)=u[Av1  =  \x{ujAv1)  (7.2) 

Therefore,  instability  of  A  implies  that: 


ay  >  1  =>•  tr  ^u^Afyj  >  1  =>■  tr  ^u^Aj  >  1  =>■  gT a  >  1  (7.3) 

Here  g  =  vec(uivj).  Since  Eq.  (7.3)  arose  from  an  unstable  solution  of  Eq.  (7.1a),  g  is  a 
hyperplane  separating  a  from  the  space  of  matrices  with  ay  <  1.  We  use  the  negation  of  Eq.  (7.3) 
as  a  constraint: 


gTa  <  1 


(7.4) 


7.2.3  Computing  the  Solution 

The  overall  quadratic  program  can  be  stated  as: 

minimize  aTPa  —  2  q'a  +  r 
subject  to  Ga  <  h 

with  a,  P,  q  and  r  as  defined  in  Eqs.  (7.1e).  {G.  h}  define  the  set  of  constraints,  and  are  initially 
empty.  The  QP  is  invoked  repeatedly  until  the  stable  region,  i.e.  S\,  is  reached.  At  each  iteration, 
we  calculate  a  linear  constraint  of  the  form  in  Eq.  (7.4),  add  the  corresponding  gT  as  a  row  in  G, 
and  augment  h  with  1.  Note  that  we  will  almost  always  stop  before  reaching  the  feasible  region 

S„. 

7.2.4  Refinement 

Once  a  stable  matrix  is  obtained,  it  is  possible  to  refine  this  solution.  We  know  that  the  last 
constraint  caused  our  solution  to  cross  the  boundary  of  S\,  so  we  interpolate  between  the  last 
solution  and  the  previous  iteration’s  solution  using  binary  search  to  look  for  a  boundary  of  the 

2Since  <ti(M)  '=  maxtt)W:iix(,||2=^1)||u||2=1  uTMv,  so  if  <ti(Mi)  <  1  and  afMf)  <  1,  then  for  all  convex 
combinations,  a1(qM1  +  (1  -  7 )M2)  =  max„il).||t(||a=.lip|^=1  'yuT Miv  +  (1  -  7 )uTM2v  <  1. 
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stable  region,  in  order  to  obtain  a  better  objective  value  while  remaining  stable.  This  results  in  a 
stable  matrix  with  top  eigenvalue  slightly  less  than  1.  In  principle,  such  an  interpolation  could  be 
attempted  between  the  least  squares  solution  and  any  stable  solution.  However,  the  stable  region 
can  be  highly  complex,  and  there  may  be  several  folds  and  boundaries  of  the  stable  region  in 
the  interpolated  area.  In  our  experiments  (not  shown),  interpolating  from  the  solutions  given  by 
LB-1  and  LB -2  yielded  worse  results. 


7.3  Experiments 

For  learning  the  dynamics  matrix,  we  implemented  subspace  identification,  constraint  generation 
(using  quadprog),  LB-1  [57]  and  LB-2  [58]  (using  CVX  with  SeDuMi)  in  Matlab  on  a  3.2 
GHz  Pentium  with  2  GB  RAM.  Note  that  the  algorithms  that  constrain  the  solution  to  be  stable 
give  a  different  result  from  the  basic  subspace  ID  algorithm  only  in  situations  when  the  learned 
A  is  unstable.  However,  LDSs  learned  in  scarce-data  scenarios  are  unstable  for  almost  any 
domain,  and  some  domains  lead  to  unstable  models  up  to  the  limit  of  available  data  (e.g.  the 
steam  dynamic  textures  in  Section  7.3.1).  The  goals  of  our  experiments  are  to:  (1)  examine 
the  state  evolution  and  simulated  observations  of  models  learned  using  constraint  generation, 
and  compare  them  to  previous  work  on  learning  stable  dynamical  systems;  and  (2)  compare 
the  algorithms  in  terms  of  predictive  accuracy  and  computational  efficiency.  We  apply  these 
algorithms  to  learning  dynamic  textures  from  the  vision  domain  (Section  7.3.1)  as  well  as  OTC 
drug  sales  counts  (Section  7.3.2)  and  sunspot  numbers  (Section  7.3.3). 

7.3.1  Stable  Dynamic  Textures 

Dynamic  textures  in  vision  can  intuitively  be  described  as  models  for  sequences  of  images  that 
exhibit  some  form  of  low-dimensional  structure  and  recurrent  (though  not  necessarily  repeating) 
characteristics,  e.g.,  fixed-background  videos  of  rising  smoke  or  flowing  water.  Treating  each 
frame  of  a  video  as  an  observation  vector  of  pixel  values  yt,  we  learned  dynamic  texture  models 
of  two  video  sequences  by  subspace  identification:  the  steam  sequence,  composed  of  120  x 
170  pixel  images,  and  the  fountain  sequence,  composed  of  150  x  90  pixel  images,  both  of 
which  originated  from  the  MIT  temporal  texture  database  (Figure  7.2(A)).  We  use  the  following 
parameters:  training  data  size  r  =  80,  number  of  latent  state  dimensions  n  =  15,  and  number  of 
past  and  future  observations  in  the  Hankel  matrix  i  =  5.  Note  that,  while  the  observations  are  the 
raw  pixel  values,  the  underlying  state  sequence  we  leam  has  no  a  priori  interpretation.  We  can, 
however,  interpret  the  underlying  state  space  as  a  set  of  predicted  observations;  that  is,  states  are 
coefficients  for  the  learned  observation  basis. 

An  LDS  model  of  a  dynamic  texture  may  synthesize  an  arbitrarily  long  sequence  of  images 
by  driving  the  model  with  zero  mean  Gaussian  noise.  Each  of  our  two  models  uses  an  80  frame 
training  sequence  to  generate  1000  sequential  images  in  this  way.  To  better  visualize  the  differ¬ 
ence  between  image  sequences  generated  by  least-squares,  LB-1,  and  constraint  generation,  the 
evolution  of  each  method’s  state  is  plotted  over  the  course  of  the  synthesized  sequences  (Fig¬ 
ure  7.2(B)).  Sequences  generated  by  the  least  squares  models  appear  to  be  unstable,  and  this 
was  in  fact  the  case;  both  the  steam  and  the  fountain  sequences  resulted  in  unstable  dy- 
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Figure  7.2:  Dynamic  textures.  A.  Samples  from  the  original  steam  sequence  and  the 
fountain  sequence.  B.  State  evolution  of  synthesized  sequences  over  1000  frames  (steam 
top,  fountain  bottom).  The  least  squares  solutions  display  instability  as  time  progresses.  The 
solutions  obtained  using  LB-1  remain  stable  for  the  full  1000  frame  image  sequence.  The  con¬ 
straint  generation  solutions,  however,  yield  state  sequences  that  are  stable  over  the  full  1000 
frame  image  sequence  without  significant  damping.  C.  Samples  drawn  from  a  least  squares  syn¬ 
thesized  sequences  (top),  and  samples  drawn  from  a  constraint  generation  synthesized  sequence 
(bottom).  Here  we  are  displaying  image  values  that  are  clipped  to  stay  within  the  valid  pixel 
range  [0,255].  Images  for  LB-1  are  not  shown.  The  constraint  generation  synthesized  steam 
sequence  is  qualitatively  better  looking  than  the  steam  sequence  generated  by  LB-1,  although 
there  is  little  qualitative  difference  between  the  two  synthesized  fountain  sequences. 


namics  matrices.  Conversely,  the  constrained  subspace  identification  algorithms  all  produced 
well-behaved  sequences  of  states  and  stable  dynamics  matrices  (Table  7.1),  although  constraint 
generation  demonstrates  the  fastest  runtime,  best  scalability,  and  lowest  error  of  any  stability- 
enforcing  approach. 

A  qualitative  comparison  of  images  generated  by  constraint  generation  and  least  squares 
(Figure  7.2(C))  indicates  the  effect  of  instability  in  synthesized  sequences  generated  from  dy¬ 
namic  texture  models.  While  the  unstable  least- squares  model  demonstrates  a  dramatic  increase 
in  image  contrast  over  time,  the  constraint  generation  model  continues  to  generate  qualitatively 
reasonable  images.  Qualitative  comparisons  between  constraint  generation  and  LB-1  indicate 
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CG 

LB-1 

LB-1* 

LB -2 

CG 

LB-1 

LB-1* 

LB-2 

steam  (n  =  10) 

fountain (n  = 

10) 

|Ai| 

1.000 

0.993 

0.993 

1.000 

0.999 

0.987 

0.987 

0.997 

O’  1 

1.036 

1.000 

1.000 

1.034 

1.051 

1.000 

1.000 

1.054 

ex(%) 

45.2 

103.3 

103.3 

546.9 

0.1 

4.1 

4.1 

3.0 

time 

0.45 

95.87 

3.77 

0.50 

0.15 

15.43 

1.09 

0.49 

steam  (n  =  20) 

fountain (n  = 

20) 

|Ai| 

0.999 

— 

0.990 

0.999 

0.999 

— 

0.988 

0.996 

1.037 

— 

1.000 

1.062 

1.054 

— 

1.000 

1.056 

ex(%) 

58.4 

— 

154.7 

294.8 

1.2 

— 

5.0 

22.3 

time 

2.37 

— 

1259.6 

33.55 

1.63 

— 

159.85 

5.13 

steam  (n  =  30) 

fountain (n  = 

30) 

|Ai| 

1.000 

— 

0.988 

1.000 

1.000 

— 

0.993 

0.998 

(J\ 

1.054 

— 

1.000 

1.130 

1.030 

— 

1.000 

1.179 

ex(%) 

63.0 

— 

341.3 

631.5 

13.3 

— 

14.9 

104.8 

time 

8.72 

— 

23978.9 

62.44 

12.68 

— 

5038.94  48.55 

steam  (n  =  40) 

fountain  (n  = 

40) 

|Ai| 

1.000 

— 

0.989 

1.000 

1.000 

— 

0.991 

1.000 

(7] 

1.120 

— 

1.000 

1.128 

1.034 

— 

1.000 

1.172 

e*(%) 

20.24 

— 

282.7 

768.5 

3.3 

— 

4.8 

21.5 

time 

5.85 

— 

79516.98 

289.79 

61.9 

— 

43457.77  239.53 

Table  7.1:  Quantitative  results  on  the  dynamic  textures  data  for  different  numbers  of  states  n.  CG 
is  our  algorithm,  LB -land  LB -2  are  competing  algorithms,  and  LB-1*  is  a  simulation  of  LB-1 
using  our  algorithm  by  generating  constraints  until  we  reach  Sa,  since  LB-1  failed  for  n  >  10  due 
to  memory  limits.  ex  is  percent  difference  in  squared  reconstruction  error.  Constraint  generation, 
in  all  cases,  has  lower  error  and  faster  runtime. 

that  constraint  generation  learns  models  that  generate  more  natural-looking  video  sequences3 
than  LB-1. 

Given  the  paucity  of  data  available  when  modeling  dynamic  textures,  it  is  not  possible  to  test 
the  long-range  predictive  power  of  the  learned  dynamical  systems  quantitatively.  Instead,  the 
error  metric  used  for  the  quantitative  experiments  when  evaluating  matrix  A*  is 

ex(A*)  =  100%  x  ( J'\A *)  -  J2(i)) /J2(A)  (7.6) 

i.e.  percent  increase  in  squared  reconstruction  error  compared  to  least  squares,  with  J(-)  as  de¬ 
fined  in  Eq.  (7.1a).  Table  7.1  demonstrates  that  constraint  generation  always  has  the  lowest  error 
as  well  as  the  fastest  runtime.  The  running  time  of  constraint  generation  depends  on  the  number 
of  constraints  needed  to  reach  a  stable  solution.  Note  that  LB-1  is  more  efficient  and  scalable 
when  simulated  using  constraint  generation  (by  adding  constraints  until  Sa  is  reached)  than  it  is 
in  its  original  SDP  formulation. 

3See  videos  at  http://www.select.cs.cmu.edu/projects/stableLDS 
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Figure  7.3:  Bar  graphs  illustrating  decreases  in  objective  function  value  relative  to  the  least 
squares  solution  (A,B)  and  the  running  times  (C,D)  for  different  stable  LDS  learning  algorithms 
on  the  fountain  and  steam  textures  respectively,  based  on  the  corresponding  columns  of 
Table  7.1. 

7.3.2  Stable  Baseline  Models  for  Biosurveillance 

We  examine  daily  counts  of  OTC  drug  sales  in  pharmacies,  obtained  from  the  National  Data 
Retail  Monitor  (NDRM)  collection  [118].  The  counts  are  divided  into  23  different  categories  and 
are  tracked  separately  for  each  zipcode  in  the  country.  We  focus  on  zipcodes  from  a  particular 
American  city  (not  identified  here  due  to  data  privacy  restrictions).  The  data  exhibits  7-day 
periodicity  due  to  differential  buying  patterns  during  weekdays  and  weekends.  We  isolate  a  60- 
day  subsequence  where  the  data  dynamics  remain  relatively  stationary,  and  attempt  to  leam  LDS 
parameters  to  be  able  to  simulate  sequences  of  baseline  values  for  use  in  detecting  anomalies. 

We  perform  two  experiments  on  different  aggregations  of  the  OTC  data,  with  parameter 
values  We  use  the  following  parameters:  number  of  latent  state  dimensions  n  =  7,  number 
of  past  and  future  observations  in  the  Hankel  matrix  i  —  4,  and  training  data  size  r  =  14. 
Figure  7.4(A)  plots  22  different  drug  categories  aggregated  over  all  zipcodes,  and  Figure  7.4(B) 
plots  a  single  drug  category  (cough/cold)  in  29  different  zipcodes  separately.  In  both  cases, 
constraint  generation  is  able  to  use  very  little  training  data  to  learn  a  stable  model  that  captures 
the  periodicity  in  the  data,  while  the  least  squares  model  is  unstable  and  its  predictions  diverge 
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A.  Multi-drug  sales  counts  B.  Multi-zipcode  sales  counts  C.  Sunspot  numbers 
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Figure  7.4:  (A):  60  days  of  data  for  22  drug  categories  aggregated  over  all  zipcodes  in  the  city. 
(B):  60  days  of  data  for  a  single  drug  category  (cough/cold)  for  all  29  zipcodes  in  the  city.  (C): 
Sunspot  numbers  for  200  years  separately  for  each  of  the  12  months.  The  training  data  (top), 
simulated  output  from  constraint  generation,  output  from  the  unstable  least  squares  model,  and 
output  from  the  over-damped  LB-1  model  (bottom). 


over  time.  LB-1  learns  a  model  that  is  stable  but  overconstrained,  and  the  simulated  observations 
quickly  drift  from  the  correct  magnitudes.  Further  details  can  be  found  in  [89]. 


7.3.3  Modeling  Sunspot  Numbers 

We  compared  least  squares  and  constraint  generation  on  learning  LDS  models  for  the  sunspot 
data  discussed  earlier.  We  use  the  following  parameters:  number  of  latent  state  dimensions 
n  =  7,  number  of  past  and  future  observations  in  the  Hankel  matrix  i  —  9,  and  training  data  size 
r  =  50.  Figure  7.4(C)  represents  a  data-poor  training  scenario  where  we  train  a  least-squares 
model  on  18  timesteps,  yielding  an  unstable  model  whose  simulated  observations  increase  in 
amplitude  steadily  over  time.  Again,  constraint  generation  is  able  to  use  very  little  training 
data  to  learn  a  stable  model  that  seems  to  capture  the  periodicity  in  the  data  if  not  the  magnitude, 
while  the  least  squares  model  is  unstable.  The  model  learned  by  LB-1  attenuates  more  noticeably, 
capturing  the  periodicity  to  a  smaller  extent.  Quantitative  results  on  both  these  domains  exhibit 
similar  trends  as  those  in  Table  7.1. 


7.4  Related  Work 

Linear  system  identification  is  a  well-studied  subject  [62].  Within  this  area,  subspace  identifica¬ 
tion  methods  ([117],  Chapter  2)  have  been  very  successful.  These  techniques  first  estimate  the 
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model  dimensionality  and  the  underlying  state  sequence,  and  then  derive  parameter  estimates  us¬ 
ing  least  squares.  Within  subspace  methods,  techniques  have  been  developed  to  enforce  stability 
by  augmenting  the  extended  observability  matrix  with  zeros  [24]  or  adding  a  regularization  term 
to  the  least  squares  objective  [116]. 

All  previous  methods  were  outperformed  by  LB-1  [57].  They  formulate  the  problem  as  a 
semidefinite  program  (SDP)  whose  objective  minimizes  the  state  sequence  reconstruction  error, 
and  whose  constraint  bounds  the  largest  singular  value  by  1.  This  convex  constraint  is  obtained 
by  rewriting  the  nonlinear  matrix  inequality  In  —  AAT  A  0  as  a  linear  matrix  inequality4,  where 
In  is  the  n  x  n  identity  matrix.  Here,  A  0  (A  0)  denotes  positive  (semi-)  definiteness.  The 
existence  of  this  constraint  also  proves  the  convexity  of  the  <j\  <  1  region.  This  condition  is 
sufficient  but  not  necessary ,  since  a  matrix  that  violates  this  condition  may  still  be  stable. 

A  follow-up  to  this  work  by  the  same  authors  [58],  which  we  call  LB-2,  attempts  to  over¬ 
come  the  conservativeness  of  LB-1  by  approximating  the  Lyapunov  inequalities  P  —  APAT  y  0, 
P  y  0  with  the  inequalities  P  —  APAff  —  SIn  A  0,  P  —  5In  >z  0,  5  >  0.  These  inequalities  hold 
iff  the  spectral  radius  is  less  than  l.5  However,  the  approximation  is  achieved  only  at  the  cost 
of  inducing  a  nonlinear  distortion  of  the  objective  function  by  a  problem-dependent  reweighting 
matrix  involving  P,  which  is  a  variable  to  be  optimized.  In  our  experiments,  this  causes  LB-2  to 
perform  worse  than  LB-1  (for  any  5)  in  terms  of  the  state  sequence  reconstruction  error  (dynamic 
textures)  and  predictive  log-likelihood  (robot  sensor  data),  even  while  obtaining  solutions  outside 
the  feasible  region  of  LB-1.  Consequently,  we  focus  on  LB-1  in  our  conceptual  and  qualitative 
comparisons  as  it  is  the  strongest  baseline  available.  However,  LB-2  is  more  scalable  than  LB-1, 
so  quantitative  results  are  presented  for  both. 

To  summarize  the  distinction  between  constraint  generation,  LB-1  and  LB-2:  it  is  hard  to 
have  both  the  right  objective  function  (reconstruction  error)  and  the  right  feasible  region  (the  set 
of  stable  matrices).  LB-1  optimizes  the  right  objective  but  over  the  wrong  feasible  region  (the 
set  of  matrices  with  a\  <  1).  LB-2  has  a  feasible  region  close  to  the  right  one,  but  at  the  cost  of 
distorting  its  objective  function  to  an  extent  that  it  fares  worse  than  LB-1  in  nearly  all  cases.  In 
contrast,  our  method  optimizes  the  right  objective  over  a  less  conservative  feasible  region  than 
that  of  any  previous  algorithm  with  the  right  objective,  and  this  combination  is  shown  to  work 
the  best  in  practice. 


4This  bounds  the  top  singular  value  by  1  since  it  implies  \/x  xT(In  —  AAT)x  >  0  =>  \/x  xTAATx  <  xTx  => 
for  v  =  v\  {AAA)  and  A  =  Ai(TTt),  uTAATu  <  vTv  =>  uT \v  <  1  =>  of  (A)  <  1  since  vTv  =  1  and 
a\(M)  =  Ai (MMT)  for  any  square  matrix  M. 

5For  a  proof  sketch,  see  Horn  and  Johnson  [39]  pg.  410. 
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Chapter  8 


Reinforcement  Learning:  Value  Iteration 
in  a  Learned  Predictive  State 
Representation 

8.1  Introduction 

In  this  chapter  we  shift  our  focus  from  learning  models  of  linear  dynamical  systems  to  planning 
in  predictive  state  representations.  Planning  a  sequence  of  actions  or  a  policy  to  maximize  reward 
has  long  been  considered  a  fundamental  problem  for  autonomous  agents.  In  the  hardest  version 
of  the  problem,  an  agent  must  form  a  plan  based  solely  on  its  own  experience,  without  the  aid 
of  a  human  engineer  who  can  design  problem-specific  models,  features  or  heuristics;  it  is  this 
version  of  the  problem  which  we  must  solve  to  build  a  truly  autonomous  agent. 

Partially  Obsen’able  Markov  Decision  Processes  (POMDPs)  [22,  97]  are  a  general  frame¬ 
work  for  single-agent  planning.  POMDPs  model  the  state  of  the  world  as  a  latent  variable 
and  explicitly  reason  about  uncertainty  in  both  action  effects  and  state  observability.  Plans  in 
POMDPs  are  expressed  as  policies,  which  specify  the  action  to  take  given  any  possible  probabil¬ 
ity  distribution  over  states.  Unfortunately,  exact  planning  algorithms  such  as  value  iteration  [97] 
are  computationally  intractable  for  most  realistic  POMDP  planning  problems.  Furthermore,  re¬ 
searchers  have  had  only  limited  success  learning  POMDP  models  from  data.  There  are  arguably 
two  primary  reasons  for  these  problems  [77].  The  first  is  the  “curse  of  dimensionality”:  for  a 
POMDP  with  n  states,  the  optimal  policy  is  a  function  of  an  a  —  1  dimensional  distribution  over 
latent  state.  The  second  is  the  “curse  of  history”:  the  number  of  distinct  policies  increases  ex¬ 
ponentially  in  the  planning  horizon.  We  hope  to  mitigate  the  curse  of  dimensionality  by  seeking 
an  approximate  dynamical  system  model  with  low  dimensionality,  and  to  mitigate  the  curse  of 
history  by  looking  for  a  model  that  is  susceptible  to  approximate  planning. 

As  we  demonstrated  in  Part  I  of  this  thesis,  Predictive  State  Representations  (PSRs)  [61]  and 
the  closely  related  Observable  Operator  Models  (OOMs)  [44]  are  generalizations  of  POMDPs 
that  have  attracted  interest  because  they  both  have  greater  representational  capacity  than  POMDPs 
and  yield  representations  that  are  at  least  as  compact  [31,  93].  An  additional  benefit  of  PSRs  and 
OOMs  is  that  many  successful  approximate  planning  techniques  for  POMDPs  can  be  used  to 
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plan  in  PSRs  and  OOMs  with  minimal  adjustment.  Accordingly,  PSR  and  OOM  models  of  dy¬ 
namical  systems  have  potential  to  overcome  both  the  curse  of  dimensionality  and  the  curse  of 
history. 

The  quality  of  an  optimized  policy  for  a  POMDP,  PSR,  or  OOM  depends  strongly  on  the 
accuracy  of  the  model:  inaccurate  models  typically  lead  to  useless  plans.  We  can  specify  a 
model  manually  or  leam  one  from  data.  A  fully  autonomous  agent  must  be  able  to  leam  models 
from  data,  but  due  to  the  difficulty  of  learning,  it  is  far  more  common  to  see  planning  algo¬ 
rithms  applied  to  hand- specified  models,  and  therefore  to  small  systems  where  there  is  extensive 
and  goal-relevant  domain  knowledge.  For  example,  recent  extensions  of  approximate  planning 
techniques  for  PSRs  have  only  been  applied  to  hand-constructed  models  [43,  45]. 

Work  that  does  learn  models  for  planning  in  partially  observable  environments  has  so  far  met 
with  only  limited  success.  As  a  result,  there  have  been  few  successful  attempts  at  closing  the 
loop  by  learning  a  model  from  an  environment,  planning  in  that  model,  and  testing  the  plan  in 
the  environment.  For  example,  Expectation-Maximization  (or  EM — see,  e.g.,  [9])  does  not  avoid 
local  minima  or  scale  to  large  state  spaces;  and,  although  many  learning  algorithms  have  been 
proposed  for  PSRs  [16,  67,  92,  1 19,  123]  and  OOMs  [38,  44,  63],  none  have  been  shown  to  learn 
models  that  are  accurate  enough  for  planning. 

Several  researchers  have,  however,  made  progress  in  the  problem  of  planning  using  a  learned 
model.  In  one  instance  [87],  researchers  obtained  a  POMDP  heuristically  from  the  output  of 
a  model- free  algorithm  [66]  and  demonstrated  planning  on  a  small  toy  maze.  In  another  in¬ 
stance  [85],  researchers  used  Markov  Chain  Monte  Carlo  (MCMC)  inference  both  to  leam  a 
factored  Dynamic  Bayesian  Network  (DBN)  representation  of  a  POMDP  in  a  small  synthetic 
network  administration  domain,  as  well  as  to  perform  online  planning.  Due  to  the  cost  of  the 
MCMC  sampler  used,  this  approach  is  still  impractical  for  larger  models.  In  a  third  example, 
researchers  learned  Linear-Linear  Exponential  Family  PSRs  from  an  agent  traversing  a  simu¬ 
lated  environment,  and  found  a  policy  using  a  policy  gradient  technique  with  a  parameterized 
function  of  the  learned  PSR  state  as  input  [120,  122].  In  this  case  both  the  learning  and  the  plan¬ 
ning  algorithm  were  subject  to  local  optima.  In  addition,  the  authors  determined  that  the  learned 
model  was  too  inaccurate  to  support  value-function-based  planning  methods  [120].  Finally,  there 
is  a  successful  line  of  research  which  computes  closed-loop  controllers  from  learned  or  partly- 
learned  models,  starting  from  linear  subspace  identification  [117]  and  ranging  to  controllers  for 
helicopters  [72]  and  bird-like  robots  [107].  This  line  of  research  uses  techniques  similar  to  the 
ones  described  here;  but,  it  focuses  on  control -like  problems,  in  which  accurate  state  estima¬ 
tion  and  dealing  with  continuous  controls  are  the  main  sources  of  difficulty,  in  contrast  to  the 
planning-like  problems  we  consider  here,  in  which  longer-term  lookahead  and  discrete  choices 
are  more  important. 

The  current  work  differs  from  these  and  other  previous  examples  of  planning  in  learned 
models:  it  both  uses  a  principled  and  provably  statistically  consistent  model-learning  algorithm, 
which  we  developed  in  Chapter  3  and  Chapter  4,  and  demonstrates  that  this  algorithm  is  able 
to  leam  compact  models  of  a  difficult,  realistic  dynamical  system  without  any  prior  domain 
knowledge  built  into  the  model  or  algorithm.  Finally,  we  perform  approximate  point-based  value 
iteration  (PB  VI)  in  the  learned  compact  models,  and  demonstrate  that  the  greedy  policy  for  the  re¬ 
sulting  value  function  works  well  in  the  original  (not  the  learned)  system.  To  our  knowledge  this 
is  the  first  research  that  combines  all  of  these  achievements,  closing  the  loop  from  observations 
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to  actions  in  an  unknown  nonlinear,  non-Gaussian  planning  system  with  no  human  intervention 
beyond  collecting  the  raw  transition  data  and  specifying  features. 


8.2  Planning  in  PSRs 

The  primary  motivation  for  modeling  a  controlled  dynamical  system  is  to  reason  about  the  effects 
of  taking  a  sequence  of  actions  in  the  system.  Although  this  paper  is  not  predominantly  about 
planning  algorithms  for  PSRs,  since  PSR  planning  is  a  straightforward  extension  of  POMDP 
planning  [43,  45],  we  describe  PSR  planning  here  since  it  is  needed  for  our  closing-the-loop 
experiments.  The  PSR  model  can  be  augmented  for  this  purpose  by  specifying  a  linear  reward 
function  for  taking  an  action  at  in  state  bt: 


n(btlat)  =  r,lbt  (8.1) 

where  rjJt  £  M”  is  the  linear  function  specified  by  action  a,.  (This  extension  generalizes  the  state- 
dependent  rewards  of  (PO)MDPs.)  Given  this  function  and  a  discount  factor  7,  the  planning 
problem  for  PSRs  is  to  find  a  policy  that  maximizes  the  expected  discounted  sum  of  rewards, 
EE,  7  at)\.  The  optimal  policy  can  be  compactly  represented  using  the  optimal  value 

function  J*(bt),  which  specifies  the  expected  sum  of  future  rewards  in  each  PSR  state.  The  value 
function  is  defined  recursively  as: 


def 

=  max 
aeA 


TZ(bt,at 


a)  +  'f^¥[ot  =  o\bt,at  =  a ]  V*  (btao) 
oeo 


(8.2) 


where  btao  is  the  state  obtained  from  bt  after  executing  action  a  and  observing  o.  We  have  im¬ 
plicitly  assumed  that  the  expected  reward  is  a  linear  function  of  the  PSR  state;  we  can  ensure 
that  this  assumption  holds  by  including  the  reward  as  an  observation  when  we  learn  the  PSR 
dynamics.  (Or,  if  the  reward  is  not  directly  observable,  by  including  its  expectation  given  all 
observable  information.)  We  can  obtain  the  optimal  action  by  taking  the  arg  max  instead  of  the 
max  in  Equation  8.3: 


* / 7  \  def 

7 r  (ot)  =  arg  max 

ctGvA 


TZ{bt)  at  =  a)  +  7  ^  P  [ot  =  o  |  bt,  at  =  a]  V*(btao ) 


oGO 


(8.3) 


When  optimized  exactly,  the  value  function  is  always  piecewise  linear  and  convex  (PWLC)  in 
the  state,  and  has  finitely  many  pieces  in  finite-horizon  planning  problems  [45].  In  this  case,  the 
value  function  for  a  finite  horizon  t  can  be  expressed  by  a  set  of  vectors  Tf  =  {on, . . . ,  ag}.  Each 
q  - vector  represents  a  hyperplane,  and  defines  a  value  function  over  a  bounded  region  of  the  PSR 
state  space: 


Jt(bt )  =  ma  xaTbt 

a£Tt 


(8.4) 
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The  finite  horizon  t  set  I\  can  be  generated  recursively  from  Tt_!  through  a  process  called  Exact 
Value  Iteration.  Let 

r“’°  =  {a“’°  I  ai  e  rt_!,a  eA,oeO} 

a,o  def  Dj 

Then,  for  each  action  a  G  A  the  set  T“  is  generated  by: 

T“  =  7£(&t,  at)  +  ©  T“’°  (8.7) 

oeo 

where  ©  denotes  the  cross-sum  operator.  Finally,  the  new  set  Tt  is  the  union 

rf  =  |J  r“  (8.8) 

aeA 

Exact  value  iteration  for  PSRs  optimizes  the  value  function  over  all  possible  beliefs  or  state 
vectors.  However,  computing  the  exact  value  function  is  problematic  because  the  number  of 
sequences  of  actions  that  must  be  considered  grows  exponentially  with  the  planning  horizon, 
called  the  “curse  of  history.”  Approximate  point-based  planning  techniques  (see  below)  specif¬ 
ically  target  the  curse  of  history  by  attempting  only  to  calculate  the  best  sequence  of  actions  at 
some  finite  set  of  belief  points.  Unfortunately,  in  high  dimensions,  approximate  planning  tech¬ 
niques  have  difficulty  adequately  sampling  the  space  of  possible  beliefs.  This  is  called  the  “curse 
of  dimensionality.”  Because  PSRs  often  admit  a  compact  low-dimensional  representation,  they 
can  reduce  the  effect  of  the  curse  of  dimensionality,  and  so  approximate  point-based  planning 
techniques  can  work  well  in  these  models. 

Point-Based  Value  Iteration  (PBVI)  [76]  is  an  efficient  approximation  of  exact  value  iteration 
that  performs  value  backup  steps  on  a  finite  set  of  heuristically-chosen  belief  points  rather  than 
over  the  entire  belief  simplex.  PBVI  exploits  the  fact  that  the  value  function  is  PWLC.  A 
linear  lower  bound  on  the  value  function  at  one  point  b  can  be  used  as  a  lower  bound  at  nearby 
points;  this  insight  allows  the  value  function  to  be  approximated  with  a  finite  set  of  a- vectors, 
one  for  each  chosen  point.  Although  PBVI  was  designed  for  POMDPs,  the  approach  has  been 
generalized  to  PSRs  [43].  Formally,  given  some  set  of  points  B  =  {7g , . . . ,  bg}  in  the  PSR  state 
space,  we  recursively  compute  the  value  function  and  linear  lower  bounds  at  only  these  points. 
The  approximate  value  function  after  t  iterations  can  be  represented  by  a  set  of  a- vectors  Yt  such 
that  each  a*  is  maximal  for  least  one  prediction  vector  /g. 

As  with  exact  value  iteration,  Tt  can  be  generated  recursively  from  T, _  ] . 

T“  =  {al  |  b  G  B}  al  —  o)  +  arg  max  aTb  (8.9) 

'  J  ^  s-T^a,o 

oeo  aert 

where  T“’°  is  as  given  in  Equation  8.5.  Because  we  only  need  to  generate  one  cr-vector  <  per 
PSR  state  b  for  each  action  a  G  A,  we  calculate  summations  for  only  these  states  and  do  not  need 
the  cross-sum  operation  (Equation  8.7).  Finally  we  find  the  best  o-vcctor  for  each  PSR  state 

a?,  =  arg  max  aT b  (V6  G  B)  (8.10) 

aeu0r“ 


(8.5) 

(8.6) 
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Once  Tt  has  been  calculated,  the  approximate  value  function  at  any  PSR  state  b,  including  b  (j  B 
can  be  found  as  before  (Equation  8.4). 

In  addition  to  being  tractable  on  much  larger-scale  planning  problems  than  exact  value  itera¬ 
tion,  PB  VI  comes  with  theoretical  guarantees  in  the  form  of  error  bounds  that  are  low-order  poly¬ 
nomials  in  the  degree  of  approximation,  range  of  reward  values,  and  discount  factor  7  [43,  76]. 
In  our  experiments,  we  use  Perseus  [45,  102],  a  variant  of  PBVI  that  updates  the  value  function 
over  a  small  randomized  subset  of  a  large  set  of  reachable  belief  points  at  each  time  step.  By 
only  updating  a  subset  of  belief  points,  Perseus  can  achieve  a  computational  advantage  over  plain 
PBVI  in  some  domains. 


8.3  Experimental  Results 

We  have  introduced  a  novel  algorithm  for  learning  PSRs  directly  from  data,  as  well  as  a  kernel- 
based  extension  for  modeling  continuous  observations.  We  judge  the  quality  of  our  PSR  learning 
algorithm  by  first  learning  a  model  of  a  challenging  non-linear,  partially  observable,  controlled 
domain  directly  from  sensor  inputs  and  then  “closing  the  loop”  by  planning  in  the  learned  model. 
Successful  planning  is  a  much  stronger  result  than  standard  dynamical  system  evaluations  such 
as  one-step  squared  error  or  prediction  log-likelihood.  Unlike  previous  attempts  to  learn  PSRs, 
which  either  lack  planning  results  [84,  121],  or  which  compare  policies  within  the  learned  sys¬ 
tem  [122],  we  compare  our  resulting  policy  to  a  bound  on  the  best  possible  solution  in  the  original 
system  and  demonstrate  that  the  policy  is  close  to  optimal. 

8.3.1  The  Autonomous  Robot  Domain 

Our  simulated  autonomous  robot  domain  consisted  of  a  simple  45  x  45  unit  square  arena  with 
a  central  obstacle  and  brightly  colored  walls  (Figure  8.1(A-B)),  containing  a  robot  of  radius  2 
units.  The  robot  could  move  around  the  floor  of  the  arena  and  rotate  to  face  in  any  direction. 
The  robot  had  a  simulated  16  x  16  pixel  color  camera,  with  a  focal  plane  one  unit  in  front 
of  the  robot’s  center  of  rotation,  and  with  a  visual  field  of  45°  in  both  azimuth  and  elevation, 
corresponding  to  an  angular  resolution  of  ~  2.8°  per  pixel.  Images  on  the  sensor  matrix  at 
any  moment  were  simulated  by  a  non-linear  perspective  transformation  and  projection  onto  the 
camera’s  focal  plane,  based  on  the  robot’s  current  position  and  orientation  in  the  environment. 
The  resulting  768-element  pattern  of  unprocessed  RGB  values  was  the  only  input  to  the  robot 
(images  were  not  preprocessed  to  extract  features),  and  each  action  produced  a  new  set  of  pixel 
values.  The  robot  was  able  to  move  forward  1  or  0  units,  and  simultaneously  rotate  15°,  —15°, 
or  0°,  resulting  in  6  unique  actions.  In  the  real  world,  friction,  uneven  surfaces,  and  other  factors 
confound  precisely  predictable  movements.  To  simulate  this  uncertainty,  a  small  amount  of 
Gaussian  noise  was  added  to  the  translation  (mean  0,  standard  deviation  .1  units)  and  rotation 
(mean  0,  standard  deviation  5°)  components  of  the  actions.  The  robot  was  allowed  to  occupy 
any  real-valued  (x,  y,  9)  pose  that  didn’t  intersect  a  wall;  in  case  of  an  attempt  to  drive  through  a 
wall,  we  interrupted  the  commanded  motion  just  before  contact,  simulating  an  inelastic  collision. 

The  autonomous  robot  domain  was  designed  to  be  a  difficult  domain  comparable  to  the  most 
complex  domains  that  previous  PSR  algorithms  have  attempted  to  model.  In  particular,  the 
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domain  in  this  paper  was  modeled  after  the  autonomous  robot  domains  found  in  recent  PSR 
work  [121,  122].  The  proposed  problem,  learning  a  model  of  this  domain  and  then  planning 
in  the  learned  model,  is  quite  difficult.  The  autonomous  robot  has  no  knowledge  of  any  of  the 
underlying  properties  of  the  domain,  e.g.,  the  geometry  of  the  environment  or  the  robot  motion 
model;  it  only  has  access  to  samples  of  the  3  x  256  pixel  features,  and  how  these  features  change 
as  actions  are  executed.  Writing  a  correct  policy  for  a  specific  task  in  this  domain  by  hand 
would  be  at  best  tedious — and  in  any  case,  as  mentioned  above,  it  is  often  impractical  to  hand- 
design  a  policy  for  an  autonomous  agent,  since  doing  so  requires  guessing  the  particular  planning 
problems  that  the  agent  may  face  in  the  future.  Furthermore,  the  continuous  and  non-linear  nature 
of  this  domain  makes  learning  models  difficult.  For  example,  a  POMDP  model  of  this  domain 
would  require  a  prohibitively  large  number  of  hidden  states,  making  learning  and  planning  next 
to  impossible.  PSRs  are  able  to  overcome  this  problem  by  compactly  representing  state  in  a  low¬ 
dimensional  real-valued  space,  and  the  algorithm  presented  in  this  work  allows  us  to  efficiently 
learn  the  parameters  of  the  PSR  in  closed  form. 


8.3.2  Features 

In  order  to  contend  with  the  continuous  observations  generated  by  our  experimental  system, 
we  used  features  of  actions  and  observations  when  building  our  observable  representation.  In 
particular,  we  built  features  for  each  observation  by  using  a  fraction  of  the  training  observations 
as  kernel  centers.  We  used  a  multivariate  Gaussian  kernel  with  an  elliptical  covariance  matrix, 
chosen  by  PCA:  that  is,  we  used  a  spherical  covariance  after  projecting  onto  the  eigenvectors  of 
the  covariance  matrix  of  the  observations  and  scaling  by  the  square  roots  of  the  eigenvalues.  We 
chose  the  bandwidth  manually,  by  a  coarse  search.  However,  as  we  demonstrated  in  Chapter  4 
the  exact  details  of  kernel  choice  are  not  an  essential  feature  of  our  algorithm:  any  smooth  kernel 
could  suffice  (and,  in  fact,  many  different  types  of  features  could  be  used). 

One  of  the  nice  properties  of  this  choice  of  features  is  that,  the  kernel  density  estimate  (KDE) 
of  the  observation  probability  density  function  (PDF)  is  a  convex  combination  of  these  kernels; 
since  each  kernel  integrates  to  1,  this  estimator  also  integrates  to  1.  KDE  theory  [91]  tells  us  that, 
with  the  correct  kernel  weights,  as  the  number  of  kernel  centers  and  the  number  of  samples  go  to 
infinity  and  the  kernel  bandwidth  goes  to  zero  (at  appropriate  rates),  the  KDE  estimator  converges 
to  the  observation  PDF  in  L1  norm.  The  kernel  density  estimator  is  completely  determined  by 
the  normalized  vector  of  kernel  weights;  therefore,  if  we  can  estimate  this  vector  accurately,  our 
estimate  of  the  observation  PDF  will  converge  to  the  observation  PDF  as  well.  Hence  our  goal  is 
to  predict  the  correct  expected  value  of  this  normalized  kernel  vector  given  all  past  observations. 

Given  these  features,  we  can  write  our  latent-state  update  in  the  continuous-observation  case 
in  the  form  of  Equation  4.2c,  using  a  matrix  Bao;  however,  rather  than  learning  each  of  the 
uncountably-many  Bao  matrices  separately,  we  learn  one  base  operator  per  kernel  center,  and 
use  convex  combinations  of  these  base  operators  to  compute  observable  operators  as  needed. 
For  details  on  practical  aspects  of  learning  with  continuous  observations  and  these  features,  see 
Boots  et  al.  [14]. 
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8.3.3  Learning  a  Model 

We  learn  our  model  from  a  sample  of  10000  short  trajectories,  each  containing  7  action-observation 
pairs.  We  generate  each  trajectory  by  starting  from  a  uniformly  randomly  sampled  position  in 
the  environment  and  executing  a  uniform  random  sequence  of  actions.  In  each  trajectory,  we 
consider  the  first  3  actions  the  “past,”  the  4th  action  the  “present,”  and  the  last  3  actions  the  “fu¬ 
ture.”  (The  initial  distribution  u  is,  therefore,  the  distribution  obtained  by  initializing  uniformly 
and  taking  3  random  actions.)  We  used  the  first  l  =  2000  trajectories  to  generate  kernel  centers, 
and  the  remaining  w  =  8000  to  estimate  the  matrices  and  '^T-.ao.n- 

To  define  these  matrices,  we  need  to  specify  a  set  of  indicative  features,  a  set  of  observation 
kernel  centers,  and  a  set  of  characteristic  features.  We  use  Gaussian  kernels  to  define  our  indica¬ 
tive  and  characteristic  features,  in  a  similar  manner  to  the  Gaussian  kernels  described  above  for 
observations;  our  analysis  allows  us  to  use  arbitrary  indicative  and  characteristic  features,  but  we 
found  Gaussian  kernels  to  be  convenient  and  effective.  Note  that  the  resulting  features  over  tests 
and  histories  are  just  features',  unlike  the  kernel  centers  defined  over  observations,  there  is  no 
need  to  let  the  kernel  width  approach  zero,  since  we  are  not  attempting  to  learn  accurate  PDFs 
over  the  histories  and  tests  in  %  and  T. 

In  more  detail,  we  define  a  set  of  2000  indicative  kernels ,  each  one  centered  at  a  sequence  of 
3  observations  from  the  initial  segment  of  one  of  our  trajectories.  We  choose  the  kernel  covari¬ 
ance  using  PCA  on  these  sequences  of  observations,  just  as  described  for  single  observations  in 
Section  8.3.2.  We  then  generate  our  indicative  features  for  a  new  sequence  of  three  observations 
by  evaluating  each  indicative  kernel  at  the  new  sequence,  and  normalizing  so  that  the  vector  of 
features  sums  to  one.  Similarly,  we  define  2000  characteristic  kernels ,  each  one  centered  at  a 
sequence  of  3  observations  from  the  end  of  one  of  our  sample  trajectories,  choose  a  kernel  covari¬ 
ance,  and  define  our  characteristic  feature  vector  by  evaluating  each  kernel  at  a  new  observation 
sequence  and  normalizing.  Finally,  we  define  500  obsen’ation  kernels ,  each  one  centered  at  a 
single  observation  from  the  middle  of  one  of  our  sample  trajectories,  and  replace  each  observa¬ 
tion  by  its  corresponding  vector  of  normalized  kernel  weights.  Next,  we  construct  the  matrices 
^r:H'  and  Sr+.ao,w  as  the  empirical  expectations  over  our  8000  training  trajectories  accord¬ 
ing  to  the  equations  in  Section  4.2.  Finally  we  chose  Q'\  =  5  as  the  dimension  of  our  PSR,  the 
smallest  dimension  that  was  able  to  produce  high  quality  policies  (see  Section  8.3.5  below). 

8.3.4  Qualitative  Evaluation 

Having  learned  the  parameters  of  the  PSR  according  to  the  algorithm  in  Section  4.2,  we  can 
use  the  model  for  prediction,  filtering,  and  planning  in  the  autonomous  robot  domain.  We  first 
evaluated  the  model  qualitatively  by  projecting  the  sets  of  histories  in  the  training  data  onto  the 
learned  PSR  state  space:  UT(j>™w-  We  colored  each  data  point  according  to  the  average  of  the 
red,  green,  and  blue  components  of  the  highest  probability  observation  following  the  projected 
history.  The  features  of  the  low  dimensional  embedding  clearly  capture  the  topology  of  the 
major  features  of  the  robot’s  visual  environment  (Figure  8.1(C-D)),  and  continuous  paths  in  the 
environment  translate  into  continuous  paths  in  the  latent  space  (Figure  8.1(F)).  For  example, 
positions  near  different  walls  are  mapped  to  different  “spines”  in  the  star-shaped  embedding  of 
the  state  space  (Figure  8.1(C)). 
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8.3.5  Planning  in  the  Learned  Model 


To  test  the  quality  of  the  learned  model,  we  set  up  a  navigation  problem  where  the  robot  was 
required  to  plan  to  reach  a  goal  image  (looking  directly  at  the  blue  wall).  We  specified  a  large 
reward  (1000)  for  this  observation,  a  reward  of  —1  for  colliding  with  a  wall,  and  0  for  every  other 
observation.  We  learned  a  reward  function  by  linear  regression  from  the  embedded  histories 
U  10  tllc  observed  immediate  rewards.  We  used  the  learned  reward  function  to  compute  an 
approximate  state-action  value  function  via  the  PSR  extension  of  the  Perseus  variant  of  PBVI  [43, 
45,  76,  102]  with  discount  factor  7  =  .8,  a  prediction  horizon  of  10  steps,  and  with  the  8000 
embedded  histories  as  the  set  of  belief  points.  The  learned  value  function  is  displayed  in  Figure 
8.1(E).  When  faced  with  a  new  3-step  history,  we  computed  an  initial  belief  by  starting  with 
6*  and  tracking  for  3  action-observation  pairs,  and  from  then  on  executed  the  greedy  policy  for 
our  learned  value  function  while  updating  our  belief  with  each  new  observation.  Examples  of 
paths  planned  in  the  learned  model  are  presented  in  Figure  8.1(F);  the  same  paths  are  shown  in 
geometric  space  in  Figure  8.1(G).  (Recall  that  the  robot  only  has  access  to  images,  never  its  own 
position.) 

The  reward  function  encouraged  the  robot  to  navigate  to  a  specific  set  of  points  in  the  envi¬ 
ronment,  so  the  planning  problem  can  be  viewed  as  a  shortest  path  problem.  Even  though  we 
don’t  encode  this  intuition  into  our  algorithm,  we  can  use  it  to  quantitatively  evaluate  the  perfor¬ 
mance  of  the  policy  in  the  original  system.  First  we  randomly  sampled  100  initial  histories  in  the 
environment  and  asked  the  robot  to  plan  paths  for  each  based  on  its  learned  policy.  We  compared 
the  number  of  actions  taken  both  to  a  random  policy  and  to  the  optimistic  path ,  calculated  by 
A*  search  in  the  robot’s  configuration  space  given  the  true  underlying  position.  Note  that  this 
comparison  is  somewhat  unfair:  in  order  to  achieve  the  same  cost  as  the  optimistic  path,  the  robot 
would  have  to  know  its  true  underlying  position,  the  dynamics  would  have  to  be  deterministic, 
all  rotations  would  have  to  be  instantaneous,  and  the  algorithm  would  need  an  unlimited  amount 
of  training  data.  Nonetheless,  the  results  (Figure  8.1(H))  indicate  that  the  performance  of  the 
PSR  policy  is  close  to  this  optimistic  bound.  We  think  that  this  finding  is  remarkable,  especially 
given  that  previous  approaches  have  encountered  significant  difficulty  modeling  continuous  do¬ 
mains  [47]  and  domains  with  similarly  high  levels  of  complexity  [122]. 


8.4  Conclusions 

In  Chapter  3  and  Chapter  4  we  presented  a  novel  consistent  subspace  identification  algorithms 
that  simultaneously  solve  the  discovery  and  learning  problems  for  PSRs  (and  therefore  POMDPs). 
In  this  chapter  we  showed  how  point-based  approximate  planning  techniques  can  be  used  to  solve 
the  planning  problem  in  a  learned  model.  We  demonstrated  the  representational  capacity  of  our 
model  and  the  effectiveness  of  our  learning  algorithm  by  learning  a  compact  model  from  simu¬ 
lated  autonomous  robot  vision  data.  Finally,  we  closed  the  loop  by  successfully  planning  with 
the  learned  models.  To  our  knowledge  this  is  the  first  instance  of  learning  a  model  and  success¬ 
fully  planning  in  the  learned  model  for  a  simulated  robot  in  a  nonlinear,  non-Gaussian,  partially 
observable  environment  of  this  complexity  using  a  consistent  algorithm.  We  compare  the  policy 
generated  by  our  model  to  a  bound  on  the  best  possible  value,  and  determine  that  our  policy  is 
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close  to  optimal. 

We  believe  that  spectral  PSR  learning  algorithms  can  increase  the  scope  of  planning  under 
uncertainty  for  autonomous  agents  in  previously  intractable  scenarios.  We  believe  that  this  im¬ 
provement  is  partly  due  to  the  greater  representational  power  of  PSRs  as  compared  to  POMDPs, 
and  partly  due  to  the  efficient  and  statistically  consistent  nature  of  the  learning  method. 
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Figure  8.1:  Learning  and  Planning  in  the  Autonomous  Robot  Domain.  (A)  The  robot  uses  visual 
sensing  to  traverse  a  square  domain  with  multi-colored  walls  and  a  central  obstacle.  Examples  of 
images  recorded  by  the  robot  occupying  two  different  positions  in  the  environment  are  shown  at 
the  bottom  of  the  figure.  (B)  A  to-scale  3-dimensional  view  of  the  environment.  (C)  The  2nd  and 
3rd  dimension  of  the  learned  subspace  (the  first  dimension  primarily  contained  normalization 
information).  Each  point  is  the  embedding  of  a  single  history,  displayed  with  color  equal  to  the 
average  RGB  color  in  the  first  image  in  the  highest  probability  test.  (D)  The  same  points  in  (C) 
projected  onto  the  environment’s  geometric  space.  (E)  The  value  function  computed  for  each 
embedded  point;  lighter  indicates  higher  value.  (F)  Policies  executed  in  the  learned  subspace. 
The  red,  green,  magenta,  and  yellow  paths  correspond  to  the  policy  executed  by  a  robot  with 
starting  positions  facing  the  red,  green,  magenta,  and  yellow  walls  respectively.  (G)  The  paths 
taken  by  the  robot  in  geometric  space  while  executing  the  policy.  Each  of  the  paths  corresponds 
to  the  path  of  the  same  color  in  (F).  The  darker  circles  indicate  the  starting  and  ending  positions, 
and  the  tick-mark  indicates  the  robot’s  orientation.  (H)  Analysis  of  planning  from  100  randomly 
sampled  start  positions  to  the  target  image  (facing  blue  wall).  In  the  bar  graph:  the  mean  number 
of  actions  taken  by  the  optimistic  solution  found  by  A*  search  in  configuration  space  (left);  the 
mean  number  taken  by  the  policy  found  by  Perseus  in  the  learned  model  (center);  and  the  mean 
number  taken  by  a  random  policy  (right).  The  line  graph  illustrates  the  cumulative  density  of  the 
number  of  actions  given  the  optimal,  learned,  and  random  policies. 
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Chapter  9 

Reinforcement  Learning:  Predictive  State 
Temporal  Difference  Learning 

9.1  Introduction 

In  this  chapter  we  will  show  how  to  use  spectral  learning  algorithms  to  enhance  least  squares  tem¬ 
poral  difference  learning.  Specifically,  we  examine  the  problem  of  estimating  a  policy’s  value 
function  within  a  decision  process  in  a  high  dimensional  and  partially-observable  environment, 
when  the  parameters  of  the  process  are  unknown:  i.e.  we  only  have  access  to  trajectories  of 
observations  and  rewards  sampled  from  the  process.  In  this  situation,  a  common  strategy  is  to 
employ  a  linear  architecture  and  represent  the  value  function  as  a  linear  combination  of  features 
of  (sequences  of)  observations.  A  popular  family  of  model-free  algorithms  called  temporal  dif¬ 
ference  (TD)  algorithms  [105]  can  then  be  used  to  estimate  the  parameters  of  the  value  function. 
Least-squares  TD  (LSTD)  algorithms  [17,  19,  59]  exploit  the  linearity  of  the  value  function  to 
find  the  optimal  parameters  in  a  least-squares  sense  from  time-adjacent  samples  of  features. 

Unfortunately,  choosing  a  good  set  of  features  is  hard.  The  features  must  be  predictive  of 
future  reward,  and  the  set  of  features  must  be  small  relative  to  the  amount  of  training  data,  or  TD 
learning  will  be  prone  to  overfitting.  The  problem  of  selecting  a  small  set  of  reasonable  features 
has  been  approached  from  a  number  of  different  perspectives.  In  many  domains,  the  features 
are  selected  by  hand  according  to  expert  knowledge;  however,  this  task  can  be  difficult  and  time 
consuming  in  practice.  Therefore,  a  considerable  amount  of  research  has  been  devoted  to  the 
problem  of  automatically  identifying  features  that  support  value  function  approximation. 

Much  of  this  research  is  devoted  to  finding  sets  of  features  when  the  dynamical  system  is 
known,  but  the  state  space  is  large  and  difficult  to  work  with.  For  example,  in  a  large  fully 
observable  Markov  decision  process  (MDP),  it  is  often  easier  to  estimate  the  value  function  from 
a  low  dimensional  set  of  features  than  by  using  state  directly.  So,  several  approaches  attempt  to 
automatically  discover  a  small  set  of  features  from  a  given  larger  description  of  an  MDP,  e.g.,  by 
using  a  spectral  analysis  of  the  state-space  transition  graph  to  discover  a  low-dimensional  feature 
set  that  preserves  the  graph  structure  [46,  64,  65]. 

Partially  observable  Markov  decision  processes  (POMDPs)  extend  MDPs  to  situations  where 
the  state  is  not  directly  observable  [3,  22,  97].  In  this  circumstance,  an  agent  can  plan  using  a 
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continuous  belief  state  with  dimensionality  equal  to  the  number  of  hidden  states  in  the  POMDP. 
When  the  number  of  hidden  states  is  large,  dimensionality  reduction  in  POMDPs  can  be  achieved 
by  projecting  a  high  dimensional  belief  space  to  a  lower  dimensional  one;  of  course,  the  diffi¬ 
culty  is  to  find  a  projection  which  preserves  decision  quality.  Strategies  for  finding  good  projec¬ 
tions  include  value-directed  compression  [78]  and  non- negative  matrix  factorization  [60,  110]. 
The  resulting  model  after  compression  is  a  Predictive  State  Representation  (PSR)  [61,  93],  an 
Observable  Operator  Model  [44],  or  a  multiplicity  automaton  [30].  Moving  to  one  of  these  rep¬ 
resentations  can  often  compress  a  POMDP  by  a  large  factor  with  little  or  no  loss  in  accuracy: 
examples  exist  with  arbitrarily  large  lossless  compression  factors,  and  in  practice,  we  can  often 
achieve  large  compression  ratios  with  little  loss. 

The  drawback  of  all  of  the  approaches  enumerated  above  is  that  they  first  assume  that  the  dy¬ 
namical  system  model  is  known,  and  only  then  give  us  a  way  of  finding  a  compact  representation 
and  a  value  function.  In  practice,  we  would  like  to  be  able  to  find  a  good  set  of  features,  without 
prior  knowledge  of  the  system  model.  Kolter  and  Ng  [55]  contend  with  this  problem  from  a  sparse 
feature  selection  standpoint.  Given  a  large  set  of  possibly-relevant  features  of  observations,  they 
proposed  augmenting  LSTD  by  applying  an  Li  penalty  to  the  coefficients,  forcing  LSTD  to  se¬ 
lect  a  sparse  set  of  features  for  value  function  estimation.  The  resulting  algorithm,  LARS-TD, 
works  well  in  certain  situations  (for  example,  see  Section  9.4.1),  but  only  if  our  original  large  set 
of  features  contains  a  small  subset  of  highly-relevant  features. 

Recently,  Parr  et  al.  looked  at  the  problem  of  value  function  estimation  from  the  perspective 
of  both  model-free  and  model-based  reinforcement  learning  [75].  The  model-free  approach  esti¬ 
mates  a  value  function  directly  from  sample  trajectories,  i.e.,  from  sequences  of  feature  vectors 
of  visited  states.  The  model-based  approach,  by  contrast,  first  leams  a  model  and  then  computes 
the  value  function  from  the  learned  model.  Parr  et  al.  compared  LSTD  (a  model-free  method)  to 
a  model-based  method  in  which  we  first  leam  a  linear  model  by  viewing  features  as  a  proxy  for 
state  (leading  to  a  linear  transition  matrix  that  predicts  future  features  from  past  features),  and 
then  compute  a  value  function  from  this  approximate  model.  Parr  et  al.  demonstrated  that  these 
two  approaches  compute  exactly  the  same  value  function  [75],  formalizing  a  fact  that  has  been 
recognized  to  some  degree  before  [17]. 

In  this  chapter,  we  build  on  this  insight,  while  simultaneously  finding  a  compact  set  of  fea¬ 
tures  using  spectral  system  identification  ideas  from  Chapter  3  and  Chapter  4.  First,  we  look 
at  the  problem  of  improving  LSTD  from  a  model-free  predictive-bottleneck  perspective:  given 
a  large  set  of  features  of  history,  we  devise  a  new  TD  method  called  Predictive  State  Temporal 
Difference  (PSTD)  learning  that  estimates  the  value  function  through  a  bottleneck  that  preserves 
only  predictive  information  (Section  9.3.2).  Intuitively,  this  approach  has  some  of  the  same  ben¬ 
efits  as  LARS-TD:  by  finding  a  small  set  of  predictive  features,  we  avoid  overfitting  and  make 
learning  more  data-efficient.  However,  our  method  differs  in  that  we  identify  a  small  subspace 
of  features  instead  of  a  sparse  subset  of  features.  Hence,  PSTD  and  LARS-TD  are  applicable  in 
different  situations:  as  we  show  in  our  experiments  below,  PSTD  is  better  when  we  have  many 
marginally-relevant  features,  while  LARS-TD  is  better  when  we  have  a  few  highly-relevant  fea¬ 
tures  hidden  among  many  irrelevant  ones. 

Second,  we  look  at  the  problem  of  value  function  estimation  from  a  model-based  perspective 
(Section  9.3.4).  Instead  of  learning  a  linear  transition  model  from  features,  as  in  [75],  we  use 
subspace  identification  [14,  84]  to  learn  a  PSR  from  our  samples.  Then  we  compute  a  value 
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function  via  the  Bellman  equations  for  our  learned  PSR.  This  new  approach  has  a  substantial 
benefit:  while  the  linear  feature-to-feature  transition  model  of  [75]  does  not  seem  to  have  any 
common  uses  outside  that  paper,  PSRs  have  been  proposed  numerous  times  on  their  own  merits 
(including  being  invented  independently  at  least  three  times),  and  are  a  strict  generalization  of 
POMDPs. 

Just  as  Parr  et  al.  showed  for  the  two  simpler  methods,  we  show  that  our  two  improved  meth¬ 
ods  (model-free  and  model-based)  are  equivalent.  This  result  yields  some  appealing  theoretical 
benefits:  for  example,  coefficients  of  PSTD  features  can  be  explicitly  interpreted  as  a  statisti¬ 
cally  consistent  estimate  of  the  true  underlying  system  state.  And,  the  feasibility  of  finding  the 
true  value  function  can  be  shown  to  depend  on  the  linear  dimension  of  the  dynamical  system, 
or  equivalently,  the  dimensionality  of  the  predictive  state  representation — not  on  the  cardinality 
of  the  POMDP  state  space.  Therefore  our  representation  is  naturally  “compressed”  in  the  sense 
of  [78],  speeding  up  convergence. 

The  improved  methods  also  yield  practical  benefits;  we  demonstrate  these  benefits  with  sev¬ 
eral  experiments.  First,  we  compare  PSTD  to  LSTD  and  LARS-TD  on  a  synthetic  example  using 
different  sets  of  features  to  illustrate  the  strengths  and  weaknesses  of  each  algorithm.  Next,  we 
apply  PSTD  to  a  difficult  optimal  stopping  problem  for  pricing  high-dimensional  financial  deriva¬ 
tives.  A  significant  amount  of  work  has  gone  into  hand  tuning  features  for  this  problem.  We  show 
that,  if  we  add  a  large  number  of  weakly  relevant  features  to  these  hand-tuned  features,  PSTD  can 
find  a  predictive  subspace  which  performs  much  better  than  competing  approaches,  improving 
on  the  best  previously  reported  result  for  this  particular  problem  by  a  substantial  margin. 

The  theoretical  and  empirical  results  reported  here  suggest  that,  for  many  applications  where 
LSTD  is  used  to  compute  a  value  function,  PSTD  can  be  simply  substituted  to  produce  better 
results. 


9.2  Value  Function  Approximation 

The  notion  of  a  value  function  is  of  central  importance  in  reinforcement  learning:  for  a  given 
policy  7 r,  the  value  of  state  s  is  defined  as  the  expected  discounted  sum  of  rewards  obtained  when 
starting  in  state  s  and  following  policy  7r,  Jn(s)  =  E  [Y^Lq  |  =  s,  7r].  It  is  well  known 

that  the  value  function  must  obey  the  Bellman  equation 

J*(s)  =  7 e(s)  +7  JV)P[s'  I  s,  tt(s)]  (9.1) 

s' 

If  we  know  the  transition  function  T,  and  if  the  set  of  states  S  is  sufficiently  small,  we  can 
use  (9.1)  directly  to  solve  for  the  value  function  J~.  We  can  then  execute  the  greedy  policy  for 
.J~,  setting  the  action  at  each  state  to  maximize  the  right-hand  side  of  (9.1). 

However,  we  consider  instead  the  harder  problem  of  estimating  the  value  function  when  s 
is  a  partially  observable  latent  variable,  and  when  the  transition  function  T  is  unknown.  In  this 
situation,  we  receive  information  about  s  through  observations  from  a  finite  set  O.  Our  state  (i.e., 
the  information  which  we  can  use  to  make  decisions)  is  not  an  element  of  S  but  a  history  (an 
ordered  sequence  of  action-observation  pairs  ht  —  aq,  cq, . . . ,  at- 1,  ot~i  that  have  been  executed 
and  observed  prior  to  time  t ).  If  we  knew  the  transition  model  T,  we  could  use  ht  to  infer  a 
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belief  distribution  over  S,  and  use  that  belief  (or  a  compression  of  that  belief)  as  a  state  instead; 
below,  we  will  discuss  how  to  learn  a  compressed  belief  state.  Because  of  partial  observability, 
we  can  only  hope  to  predict  expected  reward  conditioned  on  history,  7 Z(ht)  =  E'TZ(s)  \  ht\,  and 
we  must  choose  actions  as  a  function  of  history,  n(ht)  instead  of  7r(s). 

Let  TL  be  the  set  of  all  possible  histories.  TL  is  often  very  large  or  infinite,  so  instead  of 
finding  a  value  separately  for  each  history,  we  focus  on  value  functions  that  are  linear  in  features 
of  histories 

J*{s)  «  wT(j)H(ht )  (9.2) 

Here  w  G  is  a  parameter  vector  and  (j)n(ht)  G  is  a  feature  vector  for  a  history  ht.  So,  we 
can  rewrite  the  Bellman  equation  as 

wT(j)n(ht )  ~  TZ(ht )  +  7  wT <j)n(ht7ro)F[ht'iTO  \  htn]  (9.3) 

oSO 

where  Imo  is  history  h  extended  by  taking  action  "(hi)  and  observing  o. 


9.2.1  Least  Squares  Temporal  Difference  Learning 


In  general  we  don’t  know  the  transition  probabilities  F[htno  j  ht\,  but  we  do  have  samples 
of  state  features  7^  =  (j)n(ht),  next-state  features  =  (j)n(ht+i),  and  immediate  rewards 
TZt  =  'R-(hi  ).  We  can  thus  estimate  the  Bellman  equation 

wT(ff:k  «  U1:k  +  7^T</#fc+1  (9.4) 

(Here  we  have  used  the  notation  o^k  to  mean  the  matrix  whose  columns  are  cff  for  t  —  1 . . .  k.) 
We  can  immediately  attempt  to  estimate  the  parameter  w  by  solving  the  linear  system  in  the 
least  squares  sense:  wT  =  lZl:k  —  7 ,  where  ^  indicates  the  Moore-Penrose  pseudo¬ 
inverse.  However,  this  solution  is  biased  [19],  since  the  independent  variables  7^  —  E^t+i  are 
noisy  samples  of  the  expected  difference  E [(j)n(ht)  —  7  Yho&o  (j>n{htTto)F[htTTO  \  ht]].  In  other 
words,  estimating  the  value  function  parameters  w  is  an  error-in-variables  problem. 

The  least  squares  temporal  difference  (LSTD)  algorithm  provides  a  consistent  estimate  of  the 
independent  variables  by  right  multiplying  the  approximate  Bellman  equation  (Equation  9.4)  by 
cff  .  The  quantity  cff  can  be  viewed  as  an  instrumental  variable  [19],  i.e.,  a  measurement  that 
is  correlated  with  the  true  independent  variables,  but  uncorrelated  with  the  noise  in  our  estimates 
of  these  variables.1  The  value  function  parameter  w  may  then  be  estimated  as  follows: 


,4?t 
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ly 

kk 


!>n  <hnJ 
7+1 V7 


As  the  amount  of  data  k  increases,  the  empirical  covariance  matrices  I  k  and  /k 

converge  with  probability  1  to  their  population  values,  and  so  our  estimate  of  the  matrix  to  be 
inverted  in  (9.5)  is  consistent.  Therefore,  as  long  as  this  matrix  is  nonsingular,  our  estimate  of 
the  inverse  is  also  consistent,  and  our  estimate  of  w  therefore  converges  to  the  true  parameters 
with  probability  1 . 


'The  LSTD  algorithm  can  also  be  theoretically  justified  as  the  result  of  an  application  of  the  Bellman  operator 
followed  by  an  orthogonal  projection  back  onto  the  row  space  of  cj)n  [59], 
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9.3  Predictive  Features 


Although  LSTD  provides  a  consistent  estimate  of  the  value  function  parameters  w,  in  practice, 
the  potential  size  of  the  feature  vectors  can  be  a  problem.  If  the  number  of  features  is  large 
relative  to  the  number  of  training  samples,  then  the  estimation  of  w  is  prone  to  overfitting.  This 
problem  can  be  alleviated  by  choosing  some  small  set  of  features  that  only  contain  information 
that  is  relevant  for  value  function  approximation.  However,  with  the  exception  of  LARS-TD  [55], 
there  has  been  little  work  on  the  problem  of  how  to  select  features  automatically  for  value  func¬ 
tion  approximation  when  the  system  model  is  unknown;  and  of  course,  manual  feature  selection 
depends  on  not-always-available  expert  guidance. 

We  approach  the  problem  of  finding  a  good  set  of  features  from  a  bottleneck  perspective. 
That  is,  given  some  signal  from  history,  in  this  case  a  large  set  of  features,  we  would  like  to 
find  a  compression  that  preserves  only  relevant  information  for  predicting  the  value  function 
J~.  This  idea,  finding  a  set  of  predictive  features  of  the  future,  is  directly  related  to  spectral 
identification  of  PSRs.  In  particular,  we  think  of  predictions  of  the  future  as  tests  (Chapter  3)  or, 
more  generally  characteristic  features  (Chapter  4).  Here  we  will  use  the  terms  “characteristic 
features”  and  “features  of  the  future”  interchangeably. 

9.3.1  Finding  Predictive  Features  Through  a  Bottleneck 

In  order  to  find  a  predictive  feature  compression,  we  first  need  to  determine  what  we  would  like 
to  predict;  i.e.  what  characteristic  features  to  choose.  Since  we  are  interested  in  value  function 
approximation,  the  most  relevant  prediction  is  the  value  function  itself;  so,  we  could  simply  try 
to  predict  total  future  discounted  reward  given  a  history.  Unfortunately,  total  discounted  reward 
has  high  variance,  so  unless  we  have  a  lot  of  data,  learning  will  be  difficult. 

We  can  reduce  variance  by  including  other  prediction  tasks  as  well.  For  example,  predicting 
individual  rewards  at  future  time  steps,  while  not  strictly  necessary  to  predict  total  discounted 
reward,  seems  highly  relevant,  and  gives  us  much  more  immediate  feedback.  Similarly,  future 
observations  hopefully  contain  information  about  future  reward,  so  trying  to  predict  observations 
can  help  us  predict  reward  better.  Finally,  in  any  specific  RL  application,  we  may  be  able  to  add 
problem-specific  prediction  tasks  that  will  help  focus  our  attention  on  relevant  information:  for 
example,  in  a  path-planning  problem,  we  might  try  to  predict  which  of  several  goal  states  we  will 
reach  (in  addition  to  how  much  it  will  cost  to  get  there). 

We  can  represent  all  of  these  prediction  tasks  as  features  of  the  future:  e.g.,  to  predict  which 
goal  we  will  reach,  we  add  a  distinct  observation  at  each  goal  state,  or  to  predict  individual 
rewards,  we  add  individual  rewards  as  observations.2  We  will  write  (p[  for  the  vector  of  all 
features  of  the  “future  at  time  £,”  i.e.,  events  starting  at  time  t  +  1  and  continuing  forward. 

So,  instead  of  remembering  a  large  arbitrary  set  of  features  of  history,  we  want  to  find  a  small 
subspace  of  features  of  history  that  is  relevant  for  predicting  features  of  the  future.  We  will  call 

2If  we  don’t  wish  to  reveal  extra  information  by  adding  additional  observations,  we  can  instead  add  the  corre¬ 
sponding  feature  predictions  as  observations;  these  predictions,  by  definition,  reveal  no  additional  information.  To 
save  the  trouble  of  computing  these  predictions,  we  can  use  realized  feature  values  rather  than  predictions  in  our 
learning  algorithms  below,  at  the  cost  of  some  extra  variance:  the  expectation  of  the  realized  feature  value  is  the 
same  as  the  expectation  of  the  predicted  feature  value. 
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this  subspace  a  predictive  compression ,  and  we  will  write  the  value  function  as  a  linear  function 
of  only  the  predictive  compression  of  features. 

To  find  our  predictive  compression,  we  will  use  reduced-rank  regression  [83].  We  define  the 
following  empirical  covariance  matrices  between  features  of  the  future  and  features  of  histories: 

£t,m  =  i  X>  W"  (9.6) 

t= 1  ^  t=  1 

Let  L  j-i  be  the  lower  triangular  Cholesky  factor  of  E ■H  -H.  Then  we  can  find  a  predictive  compres¬ 
sion  of  histories  by  a  singular  value  decomposition  (SVD)  of  the  weighted  covariance:  write 

UVVT  «  E r,nLnJ  (9.7) 

for  a  truncated  SVD  [36]  of  the  weighted  covariance,  where  U  are  the  left  singular  vectors,  VT 
are  the  right  singular  vectors,  and  V  is  the  diagonal  matrix  of  singular  values.  The  number  of 
columns  of  U ,  V,  or  V  is  equal  to  the  number  of  retained  singular  values.3  Then  we  define 

U  =  UVl/2  (9.8) 

to  be  the  mapping  from  the  low-dimensional  compressed  space  up  to  the  high-dimensional  space 
of  features  of  the  future. 

Given  U,  we  would  like  to  find  a  compression  operator  V  that  optimally  predicts  features 
of  the  future  through  the  bottleneck  defined  by  U.  The  least  squares  estimate  can  be  found  by 
minimizing  the  loss 


C(V) 


tL  -  uv0 1« 


H 


(9.9) 


where  ||  •  \\p  denotes  the  Frobenius  norm.  We  can  find  the  minimum  by  taking  the  derivative  of 
this  loss  with  respect  to  V,  setting  it  to  zero,  and  solving  for  V  (see  Appendix,  Section  12.1.1  for 
details),  giving  us: 


V  =  argmin  C(V)  =  U^n^u)-1  (9.10) 

By  weighting  different  features  of  the  future  differently,  we  can  change  the  approximate 
compression  in  interesting  ways.  For  example,  as  we  will  see  in  Section  9.3.4,  scaling  up  future 
reward  by  a  constant  factor  results  in  a  value-directed  compression — but,  unlike  previous  ways 

3  If  our  empirical  estimate  H'j-.'n  were  exact,  we  could  keep  all  nonzero  singular  values  to  find  the  smallest  possi¬ 
ble  compression  that  does  not  lose  any  predictive  power.  In  practice,  though,  there  will  be  noise  in  our  estimate,  and 
T‘T,nL^T  will  be  full  rank.  If  we  know  the  true  rank  n  of  E ppi,  we  can  choose  the  first  n  singular  values  to  define 
a  subspace  for  compression.  Or,  we  can  choose  a  smaller  subspace  that  results  in  an  approximate  compression:  by 
selectively  dropping  columns  of  U  corresponding  to  small  singular  values,  we  can  trade  off  compression  against 
predictive  power.  Directions  of  larger  variance  in  features  of  the  future  correspond  to  larger  singular  values  in  the 
SVD,  so  we  minimize  prediction  error  by  truncating  the  smallest  singular  values.  By  contrast  with  an  SVD  of  the 
unsealed  covariance,  we  do  not  attempt  to  minimize  reconstruction  error  for  features  of  history,  since  features  of 
history  are  standardized  when  we  multiply  by  the  inverse  Cholesky  factor. 
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to  find  value-directed  compressions  [78],  we  do  not  need  to  know  a  model  of  our  system  ahead 
of  time.  For  another  example,  define  Lj-  to  be  the  lower  triangular  Cholesky  factor  of  the  em¬ 
pirical  covariance  of  future  features  £t,t-  Then,  if  we  scale  features  of  the  future  by  LTT ,  the 
singular  value  decomposition  will  preserve  the  largest  possible  amount  of  mutual  information 
between  features  of  the  future  and  features  of  history.  This  is  equivalent  to  canonical  correlation 
analysis  [41,  95],  and  the  matrix  V  becomes  a  diagonal  matrix  of  canonical  correlations  between 
futures  and  histories. 

9.3.2  Predictive  State  Temporal  Difference  Learning 

Now  that  we  have  found  a  predictive  compression  operator  V  via  Equation  9.10,  we  can  re¬ 
place  the  features  of  history  0^  with  the  compressed  features  Vcffi  in  the  Bellman  recursion, 
Equation  9.4.  Doing  so  results  in  the  following  approximate  Bellman  equation: 

wTv<ft:k «  n1:k  +  ^wTV^k+1  (9.11) 

The  least  squares  solution  for  w  is  still  prone  to  an  error-in-variables  problem.  The  variable  0H 
is  still  correlated  with  the  true  independent  variables  and  uncorrelated  with  noise,  and  so  we 
can  again  use  it  as  an  instrumental  variable  to  unbias  the  estimate  of  w.  Define  the  additional 
empirical  covariance  matrices: 

k  k 

^  =  l  n^T  =  l  E  (9-12) 

t= 1  t= 1 

Then,  the  corrected  Bellman  equation  is: 

wTVTl-Ht-H  =  +  'ywTV'En+^n  (9.13) 

and  solving  for  w  gives  us  the  Predictive  State  Temporal  Difference  (PSTD)  learning  algorithm: 

wT  =  Tjk,h  (v^h,h  ~  (9.14) 

So  far  we  have  provided  some  intuition  for  why  predictive  features  should  be  better  than  arbitrary 
features  for  temporal  difference  learning.  Below  we  will  show  an  additional  benefit:  the  model- 
free  algorithm  in  Equation  9.14  is,  under  some  circumstances,  equivalent  to  a  model-based  value 
function  approximation  method  which  uses  subspace  identification  to  leam  Predictive  State  Rep¬ 
resentations  [14,  84]. 

9.3.3  PSRs 

We  are  ow  almost  ready  to  show  that  the  model-free  PSTD  learning  algorithm  introduced  in 
Section  9.3.2  is  equivalent  to  a  model-based  algorithm  built  around  PSR  learning.  In  order  to  do 
so,  we  will  use  a  number  of  equations  from  Chapter  4,  specifically  the  definitions  of  moments 
given  in  Equations  4.1  and  PSR  parameter  estimates  given  in  Equations  4.2.  In  addition  we 
define  a  few  more  moments  and  parameters: 
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First  we  define  Y,n+,aon,  a  set  of  matrices,  one  for  each  action- observation  pair,  that  represent 
the  covariance  between  features  of  history  before  and  after  taking  action  a  and  observing  o.  In 
the  following,  we  assume  that  histories  ht  ~  u. 


S H+,ao,n  =  \  Y  =  °)<A 
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(9.15a) 


Since  the  dimensions  of  each  are  fixed,  as  k  — *  oo  these  empirical  covariances  converge 

to  the  true  covariances  Sw+ia0jW  with  probability  1.  Next  we  define  E^.^  =f  E[7 Zt(j>Y  \  ht  ~  uj\, 
and  approximate  the  covariance  (in  this  case  a  vector)  of  reward  and  features  of  history: 
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(9.15b) 


Again,  as  k  — »  oo,  Eft^  converges  to  Eft^  with  probability  1. 

We  now  wish  to  use  the  above-defined  matrices  to  learn  a  PSR  from  data.  To  do  so  we  need  to 
make  a  somewhat-restrictive  assumption:  we  assume  that  our  features  of  history  are  rich  enough 
to  determine  the  state  of  the  system,  i.e.,  the  regression  from  oH  to  s  is  exact:  st  =  E 
We  discuss  how  to  relax  this  assumption  below  in  Section  9.3.5.  We  also  need  a  matrix  U  such 
that  f/T$Tr  is  invertible;  with  probability  1  a  random  matrix  satisfies  this  condition,  but  as  we 
will  see  below,  it  is  useful  to  choose  U  via  SVD  of  a  scaled  version  of  E-^ft  as  described  in 
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Sec.  9.3.1.  Using  our  assumptions  we  can  show  a  useful  identity  for 
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This  identity  is  at  the  heart  of  our  learning  algorithm:  it  shows  that  S^  ao  ^  contains  a  hidden 
copy  of  Mao,  the  main  PSR  parameter  that  we  need  to  learn.  We  would  like  to  recover  Mao  via 
Eq.  9.16,  Mao  =  Ex,h^h1h£‘H+  ,ao,H^x  H’  ^ut  °f  course  we  do  not  know  E x,u-  Fortunately, 
though,  it  turns  out  that  we  can  use  UtEt-  w  as  a  stand-in,  since  this  matrix  differs  from  EAjW 
only  by  an  invertible  transform. 


bt  =  C/TEr)W(E^)-V« 

=  t/T$rrsXiW(Ew,w)-V* 

=  ( UT$TV)st  (9.17a) 

Bao  ^  f/TEr,w(Ew,w)-1Ew+a0)W(f/TEr,w)t 
=  C/T$rrEX!W(EWiH)-1Ew+)a0)W(C/TEr>H)t 
=  (UT<f>rr)MaoEx^(f/TEr,w)t 
=  ( UT  <f>r  T )  Mao  ( UT  <hr  T)  ~ 1  ( f/T  <f)T  i?)  E  ( UT  T,t,h  ) f 

=  (C/T$rr)Mao(C/T$rr)-1  (9.17b) 

^  =  S^(f/TSr,w)t 
-  77TEA,H(C/TEr>H)t 

=  r/T(UT$Tr)-1(f/T$rr)EA-,w(f/TEr^)t 

=  r;T(£7T$rr)-1  (9.17c) 


9.3.4  Predictive  State  Temporal  Difference  Learning  Revisited 

With  the  additional  parameter  definitions  above  it  is  not  difficult  to  show  that  model-free  PSTD 
learning  is  actually  leveraging  ideas  from  system  identification  to  get  good  value  function  esti¬ 
mates.  For  a  fixed  policy  n,  a  PSR’s  value  function  is  a  linear  function  of  state,  J^s)  =  wTb, 
and  is  the  solution  of  the  PSR  Bellman  equation  [45]:  for  all  b,wTb  =  bTnb  + 
or  equivalently, 

wT  =  b~l  +  7  wTBno  (9.18) 

oeo 
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If  we  substitute  in  our  learned  PSR  parameters  from  Equations  9.17(a-c),  we  get 

wT  =  E^(C/TEr>w)t  +1J2^TuT^r,n&n,n)-1^n,.o,n(UT^T,ny 

oGO 

wT  UTT,ri-H  =  H-r,h  +  7^T^T^r,w(S«,w)  1^n+,n 

since,  by  comparing  Eqs.  9.15a  and  9.12,  we  can  see  that  J2oe0'^H+,no,n  —  E u+,H ■  Now, 
suppose  that  we  define  U  and  V  by  Eqs.  9.8  and  9.10,  and  let  U  =  U  as  suggested  above  in 
Sec.  4.2.  Then  UtY>t,h  —  and 

wT  VT,-h,h  =  ^n,n  +  7WTCEW+)W 

wT  =  S 11^  (vT,-u,n  ~  lfV'E‘H+  ,nj  (9.19) 

Eq.  9.19  is  exactly  the  PSTD  algorithm  (Eq.  9.14).  So,  we  have  shown  that,  if  we  learn  a  PSR 
by  the  subspace  identification  algorithm  of  Sec.  4.2  and  then  compute  its  value  function  via  the 
Bellman  equation,  we  get  the  exact  same  answer  as  if  we  had  directly  learned  the  value  function 
via  the  model-free  PSTD  method.  In  addition  to  adding  to  our  understanding  of  both  methods, 
an  important  corollary  of  this  result  is  that  PSTD  is  a  statistically  consistent  algorithm  for  PSR 
value  function  approximation — to  our  knowledge,  the  first  such  result  for  a  TD  method. 

PSTD  learning  is  related  to  value-directed  compression  of  POMDPs  [78].  If  we  learn  a  PSR 
from  data  generated  by  a  POMDP,  then  the  PSR  state  is  exactly  a  linear  compression  of  the 
POMDP  state  [84,  93].  The  compression  can  be  exact  or  approximate,  depending  on  whether 
we  include  enough  features  of  the  future  and  whether  we  keep  all  or  only  some  nonzero  singular 
values  in  our  bottleneck.  If  we  include  only  reward  as  a  feature  of  the  future,  we  get  a  value- 
directed  compression  in  the  sense  of  Poupart  and  Boutilier  [78].  If  desired,  we  can  tune  the  de¬ 
gree  of  value-directedness  of  our  compression  by  scaling  the  relative  variance  of  our  features:  the 
higher  the  variance  of  the  reward  feature  compared  to  other  features,  the  more  value-directed  the 
resulting  compression  will  be.  Our  work  significantly  diverges  from  previous  work  on  POMDP 
compression  in  one  important  respect:  prior  work  assumes  access  to  the  true  POMDP  model, 
while  we  make  no  such  assumption,  and  learn  a  compressed  representation  directly  from  data. 

9.3.5  Insights  from  Subspace  Identification 

The  close  connection  to  subspace  identification  for  PSRs  provides  additional  insight  into  the  tem¬ 
poral  difference  learning  procedure.  In  Equation  9.17  we  made  the  assumption  that  the  features 
of  history  are  rich  enough  to  completely  determine  the  state  of  the  dynamical  system.  In  fact, 
using  theory  developed  in  Chapter  4  and  [14],  it  is  possible  to  relax  this  assumption  and  instead 
assume  that  state  is  merely  correlated  with  features  of  history.  In  this  case,  the  value  function 
parameter  w  can  be  estimated  as  wT  =  Eft.i-H(t/TE7-i-H)t(/  —  CTEr+ia0jW(f/TE7-iW)',')t  = 

^'R.,n(UTT,'r,n  —  J2oGO  UT^r+,ao,HY-  Since  we  no  longer  assume  that  state  is  completely  speci¬ 
fied  by  features  of  history,  we  can  no  longer  apply  the  learned  value  function  to  UHq-M  (E  Vt 

at  each  time  t.  Instead  we  need  to  leam  a  full  PSR  model  and  filter  with  the  model  to  estimate 
state. 
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9.4  Experimental  Results 


We  designed  several  experiments  to  evaluate  the  properties  of  the  PSTD  learning  algorithm.  In 
the  first  set  of  experiments  we  look  at  the  comparative  merits  of  PSTD  with  respect  to  LSTD 
and  LARS-TD  when  applied  to  the  problem  of  estimating  the  value  function  of  a  reduced-rank 
POMDP.  In  the  second  set  of  experiments,  we  apply  PSTD  to  a  benchmark  optimal  stopping 
problem  (pricing  a  fictitious  financial  derivative),  and  show  that  PSTD  outperforms  competing 
approaches. 


9.4.1  Estimating  the  Value  Function  of  a  RR-POMDP 

We  evaluate  the  PSTD  learning  algorithm  on  a  synthetic  example  derived  from  [90].  The  prob¬ 
lem  is  to  find  the  value  function  of  a  policy  in  a  partially  observable  Markov  decision  Process 
(POMDP).  The  POMDP  has  4  latent  states,  but  the  policy’s  transition  matrix  is  low  rank:  the 
resulting  belief  distributions  can  be  represented  in  a  3-dimensional  subspace  of  the  original  be¬ 
lief  simplex.  A  reward  of  1  is  given  in  the  first  and  third  latent  state  and  a  reward  of  0  in  the 
other  two  latent  states  (see  Appendix,  Section  12.1.2).  The  system  emits  2  possible  observations, 
conflating  information  about  the  latent  states. 

We  perform  3  experiments,  comparing  the  performance  of  LSTD,  LARS-TD,  PSTD,  and 
PSTD  as  formulated  in  Section  9.3.5  (which  we  call  PSTD2)  when  different  sets  of  features  are 
used.  In  each  case  we  compare  the  value  function  estimated  by  each  algorithm  to  the  true  value 
function  computed  by  J77  =  7 Z(I  —  yT7T)~1. 

In  the  first  experiment  we  execute  the  policy  n  for  1000  time  steps.  We  split  the  data  into 
overlapping  histories  and  tests  of  length  5,  and  sample  10  of  these  histories  and  tests  to  serve 
as  centers  for  Gaussian  radial  basis  functions.  We  then  evaluate  each  basis  function  at  every 
remaining  sample.  Then,  using  these  features,  we  learned  the  value  function  using  LSTD,  LARS- 
TD,  PSTD  with  linear  dimension  3,  and  PSTD2  with  linear  dimension  3  (Figure  9.1(A)).4  In  this 
experiment,  PSTD  and  PSTD2  both  had  lower  mean  squared  error  than  the  other  approaches.  For 
the  second  experiment,  we  added  490  random  features  to  the  10  good  features  and  then  attempted 
to  leam  the  value  function  with  each  of  the  3  algorithms  (Figure  9.1(B)).  In  this  case,  LSTD 
and  PSTD  both  had  difficulty  fitting  the  value  function  due  to  the  large  number  of  irrelevant 
features  in  both  tests  and  histories  and  the  relatively  small  amount  of  training  data.  LARS-TD, 
designed  for  precisely  this  scenario,  was  able  to  select  the  10  relevant  features  and  estimate 
the  value  function  better  by  a  substantial  margin.  Surprisingly,  in  this  experiment  PSTD2  not 
only  outperformed  PSTD  but  bested  even  LARS-TD.  For  the  third  experiment,  we  increased  the 
number  of  sampled  features  from  10  to  500.  In  this  case,  each  feature  was  somewhat  relevant, 
but  the  number  of  features  was  relatively  large  compared  to  the  amount  of  training  data.  This 
situation  occurs  frequently  in  practice:  it  is  often  easy  to  find  a  large  number  of  features  that  are 
at  least  somewhat  related  to  state.  PSTD  and  PSTD2  both  outperform  LARS-TD  and  each  of 
these  subspace  and  subset  selection  methods  outperform  LSTD  by  a  large  margin  by  efficiently 
estimating  the  value  function  (Figure  9.1(C)). 

4Comparing  LSTD  and  PSTD  is  straightforward;  the  two  methods  differ  only  by  the  compression  operator  V. 
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Figure  9.1:  Experimental  Results.  Error  bars  indicate  standard  error.  (A)  Estimating  the  value 
function  with  a  small  number  of  informative  features.  PSTD  and  PSTD2  both  do  well.  (B) 
Estimating  the  value  function  with  a  small  set  of  informative  features  and  a  large  set  of  random 
features.  LARS-TD  is  designed  for  this  scenario  and  dramatically  outperforms  PSTD  and  LSTD, 
however  it  does  not  outperform  PSTD2.  (C)  Estimating  the  value  function  with  a  large  set  of 
semi-informative  features.  PSTD  is  able  to  determine  a  small  set  of  compressed  features  that 
retain  the  maximal  amount  of  information  about  the  value  function,  outperforming  LSTD  by  a 
very  large  margin.  (D)  Pricing  a  high-dimensional  derivative  via  policy  iteration.  The  y-axis 
is  expected  reward  for  the  current  policy  at  each  iteration.  The  optimal  threshold  strategy  (sell 
if  price  is  above  a  threshold  [115])  is  in  black,  LSTD  (16  canonical  features)  is  in  blue,  LSTD 
(on  the  full  220  features)  is  cyan,  LARS-TD  (feature  selection  from  set  of  220)  is  in  green,  and 
PSTD  (16  dimensions,  compressing  220  features  (16  +  204))  is  in  red. 


9.4.2  Pricing  A  High-dimensional  Financial  Derivative 

Derivatives  are  financial  contracts  with  payoffs  linked  to  the  future  prices  of  basic  assets  such 
as  stocks,  bonds  and  commodities.  In  some  derivatives  the  contract  holder  has  no  choices,  but 
in  more  complex  cases,  the  contract  owner  must  make  decisions — e.g.,  with  early  exercise  the 
contract  holder  can  decide  to  terminate  the  contract  at  any  time  and  receive  payments  based  on 
prevailing  market  conditions.  In  these  cases,  the  value  of  the  derivative  depends  on  how  the 
contract  holder  acts.  Deciding  when  to  exercise  is  therefore  an  optimal  stopping  problem:  at 
each  point  in  time,  the  contract  holder  must  decide  whether  to  continue  holding  the  contract 
or  exercise.  Such  stopping  problems  provide  an  ideal  testbed  for  policy  evaluation  methods, 
since  we  can  easily  collect  a  single  data  set  which  is  sufficient  to  evaluate  any  policy:  we  just 
choose  the  “continue”  action  forever.  (We  can  then  evaluate  the  “stop”  action  easily  in  any  of 
the  resulting  states,  since  the  immediate  reward  is  given  by  the  rules  of  the  contract,  and  the  next 
state  is  the  terminal  state  by  definition.) 

We  consider  the  financial  derivative  introduced  by  Tsitsiklis  and  Van  Roy  [115].  The  deriva¬ 
tive  generates  payoffs  that  are  contingent  on  the  prices  of  a  single  stock.  At  the  end  of  a  given 
day,  the  holder  may  opt  to  exercise.  At  exercise  the  owner  receives  a  payoff  equal  to  the  current 
price  of  the  stock  divided  by  the  price  100  days  beforehand.  We  can  think  of  this  derivative  as 
a  “psychic  call”:  the  owner  gets  to  decide  whether  s/he  would  like  to  have  bought  an  ordinary 
100-day  European  call  option,  at  the  then-current  market  price,  100  days  ago. 
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In  our  simulation  (and  unknown  to  the  investor),  the  underlying  stock  price  follows  a  geomet¬ 
ric  Brownian  motion  with  volatility  a  =  0.02  and  continuously  compounded  short  term  growth 
rate  p  =  0.0004.  Assuming  stock  prices  fluctuate  only  on  days  when  the  market  is  open,  these  pa¬ 
rameters  correspond  to  an  annual  growth  rate  of  ~  10%.  In  more  detail,  if  wt  is  a  standard  Brow¬ 
nian  motion,  then  the  stock  price  pt  evolves  as  Vpt  =  ppt  Vt  +  aptVwt,  and  we  can  summarize 

/  \  T 


relevant  state  at  the  end  of  each  day  as  a  vector  xt  G  M100,  with  xt 


Pt-  99  Pt-  98 


Pt 


Pt-100  7  Pt-100  ' 


’  Pt-100 


The  zth  dimension  xt(i)  represents  the  amount  a  $1  investment  in  a  stock  at  time  t  —  100  would 
grow  to  at  time  t  —  100  +  i.  This  process  is  Markov  and  ergodic  [23,  115]:  xt  and  xt+100 
are  independent  and  identically  distributed.  The  immediate  reward  for  exercising  the  option  is 
G(x)  =  x(100),  and  the  immediate  reward  for  continuing  to  hold  the  option  is  0.  The  discount 
factor  7  =  e~p  is  determined  by  the  growth  rate;  this  corresponds  to  assuming  that  the  risk¬ 
free  interest  rate  is  equal  to  the  stock’s  growth  rate,  meaning  that  the  investor  gains  nothing  in 
expectation  by  holding  the  stock  itself. 

The  value  of  the  derivative,  if  the  current  state  is  x,  is  given  by  V*(x)  =  supt  E[ytG(xt)  \ 
x0  =  x].  Our  goal  is  to  calculate  an  approximate  value  function  V (x)  =  ivTc>H(x),  and  then  use 
this  value  function  to  generate  a  stopping  time  minjf  |  G(xt)  >  V (xt)}.  To  do  so,  we  sample  a 
sequence  of  1,000,000  states  xt  G  M100  and  calculate  features  <J)H  of  each  state.  We  then  perform 
policy  iteration  on  this  sample,  alternately  estimating  the  value  function  under  a  given  policy  and 
then  using  this  value  function  to  define  a  new  greedy  policy  “stop  if  G(xt)  >  vjt 0H(xi)” 


Within  the  above  strategy,  we  have  two  main  choices:  which  features  do  we  use,  and  how  do 
we  estimate  the  value  function  in  terms  of  these  features.  For  value  function  estimation,  we  used 
LSTD,  LARS-TD,  or  PSTD.  In  each  case  we  re-used  our  1,000,000- state  sample  trajectory  for 
all  iterations:  we  start  at  the  beginning  and  follow  the  trajectory  as  long  as  the  policy  chooses 
the  “continue”  action,  with  reward  0  at  each  step.  When  the  policy  executes  the  “stop”  action, 
the  reward  is  G(x)  and  the  next  state’s  features  are  all  0;  we  then  restart  the  policy  100  steps  in 
the  future,  after  the  process  has  fully  mixed.  For  feature  selection,  we  are  fortunate:  previous 
researchers  have  hand-selected  a  “good”  set  of  16  features  for  this  data  set  through  repeated  trial 
and  error  (see  Appendix,  Section  12.1.2  and  [23,  115]).  We  greatly  expand  this  set  of  features, 
then  use  PSTD  to  synthesize  a  small  set  of  high-quality  combined  features.  Specifically,  we  add 
the  entire  100-step  state  vector,  the  squares  of  the  components  of  the  state  vector,  and  several 
additional  nonlinear  features,  increasing  the  total  number  of  features  from  16  to  220.  We  use 
histories  of  length  1,  tests  of  length  5,  and  (for  comparison’s  sake)  we  choose  a  linear  dimension 
of  16.  Tests  (but  not  histories)  were  value-directed  by  reducing  the  variance  of  all  features  except 
reward  by  a  factor  of  100. 

Figure  9.  ID  shows  results.  We  compared  PSTD  (reducing  220  to  16  features)  to  LSTD  with 
either  the  16  hand-selected  features  or  the  full  220  features,  as  well  as  to  LARS-TD  (220  fea¬ 
tures)  and  to  a  simple  thresholding  strategy  [115].  In  each  case  we  evaluated  the  final  policy  on 
10,000  new  random  trajectories.  PSTD  outperformed  each  of  its  competitors,  improving  on  the 
next  best  approach,  LARS-TD,  by  1 .75  percentage  points.  In  fact,  PSTD  performs  better  than  the 
best  previously  reported  approach  [23,  1 15]  by  1.24  percentage  points.  These  improvements  cor¬ 
respond  to  appreciable  fractions  of  the  risk-free  interest  rate  (which  is  about  4  percentage  points 
over  the  100  day  window  of  the  contract),  and  therefore  to  significant  arbitrage  opportunities:  an 
investor  who  doesn’t  know  the  best  strategy  wifi  consistently  undervalue  the  security,  allowing 
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an  informed  investor  to  buy  it  for  below  its  expected  value. 


9.5  Conclusion 

In  this  chapter,  we  attack  the  feature  selection  problem  for  temporal  difference  learning.  Al¬ 
though  well-known  temporal  difference  algorithms  such  as  LSTD  can  provide  asymptotically 
unbiased  estimates  of  value  function  parameters  in  linear  architectures,  they  can  have  trouble 
in  finite  samples:  if  the  number  of  features  is  large  relative  to  the  number  of  training  samples, 
then  they  can  have  high  variance  in  their  value  function  estimates.  For  this  reason,  in  real-world 
problems,  a  substantial  amount  of  time  is  spent  selecting  a  small  set  of  features,  often  by  trial 
and  error  [23,  115]. 

To  remedy  this  problem,  we  present  the  PSTD  algorithm,  a  new  approach  to  feature  selec¬ 
tion  for  TD  methods,  which  demonstrates  how  insights  from  system  identification  can  benefit 
reinforcement  learning.  PSTD  automatically  chooses  a  small  set  of  features  that  are  relevant  for 
prediction  and  value  function  approximation.  It  approaches  feature  selection  from  a  bottleneck 
perspective,  by  finding  a  small  set  of  features  that  preserves  only  predictive  information.  Because 
of  the  focus  on  predictive  information,  the  PSTD  approach  is  closely  connected  to  PSRs:  under 
appropriate  assumptions,  PSTD’s  compressed  set  of  features  is  asymptotically  equivalent  to  PSR 
state,  and  PSTD  is  a  consistent  estimator  of  the  PSR  value  function. 

We  demonstrate  the  merits  of  PSTD  compared  to  two  popular  alternative  algorithms,  LARS- 
TD  and  LSTD,  on  a  synthetic  example,  and  argue  that  PSTD  is  most  effective  when  approxi¬ 
mating  a  value  function  from  a  large  number  of  features,  each  of  which  contains  at  least  a  little 
information  about  state.  Finally,  we  apply  PSTD  to  a  difficult  optimal  stopping  problem,  and 
demonstrate  the  practical  utility  of  the  algorithm  by  outperforming  several  alternative  approaches 
and  topping  the  best  reported  previous  results. 
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Chapter  10 

A  Spectral  Learning  Approach  to 
Range-Only  SLAM 


This  chapter  represents  a  significant  shift  in  focus  from  the  previous  chapters.  The  bulk  of  this 
thesis  has  focused  on  predictive  representations  and  spectral  learning  algorithms  for  learning  the 
parameters  of  dynamical  system  models.  This  chapter  uses  similar  tools  but  focuses  on  a  specific 
applied  problem:  range-only  simultaneous  localization  and  mapping  (range-only  SLAM)  with 
known  correspondences  and  limited  amounts  of  missing  data  [10]. 


10.1  Introduction 

In  range-only  SLAM,  we  are  given  a  sequence  of  range  measurements  from  a  robot  to  fixed 
landmarks,  and  possibly  a  matching  sequence  of  odometry  measurements.  We  then  attempt  to 
simultaneously  estimate  the  robot’s  trajectory  and  the  locations  of  the  landmarks.  Popular  ap¬ 
proaches  to  range-only  SLAM  include  EKFs  and  EIFs  [27,  28,  50,  56,  111],  multiple-hypothesis 
trackers  (including  particle  filters  and  multiple  EKFs/EIFs)  [29,  111],  and  batch  optimization  of 
a  likelihood  function  [53]. 

In  all  the  above  approaches,  the  most  popular  representation  for  a  hypothesis  is  a  list  of  land¬ 
mark  locations  (mnjX,mnty)  and  a  list  of  robot  poses  (xt,  yt,  0t).  Unfortunately,  both  the  motion 
and  measurement  models  are  highly  nonlinear  in  this  representation,  leading  to  computational 
problems:  inaccurate  linearizations  in  EKF/EIF/MHT  and  local  optima  in  batch  optimization  ap¬ 
proaches  (see  Section  10.2  for  details).  Much  work  has  attempted  to  remedy  this  problem,  e.g., 
by  changing  the  hypothesis  representation  [27]  or  by  keeping  multiple  hypotheses  [27,  29,  111]. 
While  considerable  progress  has  been  made,  none  of  these  methods  are  ideal;  common  dif¬ 
ficulties  include  the  need  for  an  extensive  initialization  phase,  inability  to  recover  from  poor 
initialization,  lack  of  performance  guarantees,  or  excessive  computational  requirements. 

We  take  a  very  different  approach:  we  formulate  range-only  SLAM  as  a  matrix  factorization 
problem,  where  features  of  observations  are  linearly  related  to  a  4-  or  7-dimensional  state  space. 
This  approach  has  several  desirable  properties.  First,  we  need  weaker  assumptions  about  the 
measurement  model  and  motion  model  than  previous  approaches  to  range-only  SLAM.  Second, 
our  state  space  yields  a  linear  measurement  model,  so  we  hope  to  lose  less  information  during 
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tracking  to  approximation  errors  and  local  optima.  Third,  our  formulation  leads  to  a  simple  spec¬ 
tral  learning  algorithm,  based  on  a  fast  and  robust  singular  value  decomposition  (SVD) — in  fact, 
our  algorithm  is  an  instance  of  a  general  spectral  system  identification  framework,  from  which 
it  inherits  desirable  guarantees  including  statistical  consistency  and  no  local  optima.  Fourth,  we 
don’t  need  to  worry  as  much  as  previous  methods  about  errors  such  as  a  consistent  bias  in  odom- 
etry,  or  a  receiver  mounted  at  a  different  height  from  the  transmitters:  in  general,  we  can  leam  to 
correct  such  errors  automatically  by  expanding  the  dimensionality  of  our  state  space. 

As  we  will  discuss  in  Section  10.2,  our  approach  to  range-only  SLAM  has  much  in  common 
with  spectral  algorithms  for  subspace  identification  [14,  117];  unlike  these  methods,  our  focus  on 
SLAM  makes  it  easy  to  interpret  our  state  space.  Our  approach  is  also  related  to  factorization- 
based  structure  from  motion  [49,  113,  114],  as  well  as  to  recent  dimensionality-reduction-based 
methods  for  localization  and  mapping  [8,  32,  86,  124]. 

We  begin  in  Section  10.2  by  reviewing  background  related  to  our  approach.  In  Section  10.3 
we  present  the  basic  spectral  learning  algorithm  for  range-only  SLAM,  and  discuss  how  it  relates 
to  state  space  discovery  for  a  dynamical  system.  We  conclude  in  Section  10.4  by  comparing  spec¬ 
tral  SLAM  to  other  popular  methods  for  range-only  SLAM  on  real  world  range  data  collected 
from  an  autonomous  lawnmower  with  time-of-flight  ranging  radios. 


10.2  Background 

There  are  four  main  pieces  of  relevant  background:  first,  the  well-known  solutions  to  range-only 
SLAM  using  variations  of  the  extended  Kalman  filter  and  batch  optimization;  second,  recently- 
discovered  spectral  approaches  to  identifying  parameters  of  nonlinear  dynamical  systems;  third, 
matrix  factorization  for  finding  structure  from  motion  in  video;  and  fourth,  dimensionality- 
reduction  methods  for  localization  and  mapping.  Below,  we  will  discuss  the  connections  among 
these  areas,  and  show  how  they  can  be  unified  within  a  spectral  learning  framework. 


10.2.1  Likelihood-based  Range-only  SLAM 

The  standard  probabilistic  model  for  range-only  localization  [50,  56]  represents  robot  state  by  a 
vector  st  =  [. xt ,  yt,  0t}T;  the  robot’s  (nonlinear)  motion  and  observation  models  are 


St+l 


xt  +  vt  cos (9t) 
Ut  +  vt  sin  (0t) 
0t  +  c ot 


+ 


dt,n  =  \J {mn,x  ~  xt)2  +  (mnty  -  yt )2  +  t]t  (10.1) 


Here  vt  is  the  distance  traveled,  c ot  is  the  orientation  change,  dt)71  is  the  estimate  of  the  range 
from  the  nth  landmark  location  (mn:X,  mn.y)  to  the  current  location  of  the  robot  ( xt,yt ),  and 
et  and  //,  are  noise.  Throughout  this  chapter  we  assume  known  correspondences,  since  range 
sensing  systems  such  as  radio  beacons  typically  associate  unique  identifiers  with  each  reading. 
In  the  case  of  unknown  correspondences  or  large  amounts  of  missing  data,  our  method  becomes 
a  solution  to  a  subproblem  -  which  helps  since  being  able  to  solve  the  subproblem  reliably  makes 
the  overall  search  easier. 
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To  handle  SLAM  rather  than  just  localization,  we  can  extend  the  state  to  include  landmark 
positions: 


st  =  [xt,  yt,  0t,  mi,,,  mhy, mNtX,  mNJT  (10.2) 

where  N  is  the  number  of  landmarks.  The  motion  and  measurement  models  remain  the  same. 
Given  this  model,  we  can  use  any  standard  optimization  algorithm  (such  as  Gauss-Newton)  to 
fit  the  unknown  robot  and  landmark  parameters  by  maximum  likelihood.  Or,  we  can  track  these 
parameters  online  using  EKFs,  EIFs,  or  MHT  methods  like  particle  filters. 

EKFs  and  EIFs  are  a  popular  solution  for  localization  and  mapping  problems:  for  each  new 
odometry  input  a,  =  [vt,uJt]T  and  each  new  measurement  dt,  we  propagate  the  estimate  of  the 
robot  state  and  error  covariance  by  linearizing  the  non-linear  motion  and  measurement  models. 
Unfortunately,  though,  range-only  SLAM  is  notoriously  difficult  for  EKFs/EIFs:  since  range- 
only  sensors  are  not  informative  enough  to  completely  localize  a  robot  or  a  landmark  from  a 
small  number  of  readings,  nonlinearities  are  much  worse  in  range-only  SLAM  than  they  are  in 
other  applications  such  as  range-and-bearing  SLAM.  In  particular,  if  we  don’t  have  a  sharp  prior 
distribution  for  landmark  positions,  then  after  a  few  steps,  the  exact  posterior  becomes  highly 
non-Gaussian  and  multimodal;  so,  any  Gaussian  approximation  to  the  posterior  is  necessarily 
inaccurate.  Furthermore,  an  EKF  will  generally  not  even  produce  the  best  possible  Gaussian 
approximation:  a  good  linearization  would  tell  us  a  lot  about  the  modes  of  the  posterior,  which 
would  be  equivalent  to  solving  the  original  SLAM  problem.  So,  practical  applications  of  the  EKF 
to  range-only  SLAM  attempt  to  delay  linearization  until  enough  information  is  available,  e.g., 
via  an  extended  initialization  phase  for  each  landmark.  Such  delays  simply  push  the  problem  of 
finding  a  good  hypothesis  onto  the  initialization  algorithm. 

Djugash  et  al.  proposed  a  polar  parameterization  to  more  accurately  represent  the  annular  and 
multimodal  distributions  typically  encountered  in  range-only  SLAM.  The  resulting  approach  is 
called  the  ROP-EKF,  and  is  shown  to  outperform  the  ordinary  (Cartesian)  EKF  in  several  real- 
world  problems,  especially  in  combination  with  multiple-hypothesis  tracking  [27,  28].  However, 
the  multi-hypothesis  ROP-EKF  can  be  much  more  expensive  than  an  EKF,  and  is  still  a  heuristic 
approximation  to  the  true  posterior. 

Instead  of  the  posterior  covariance  of  the  state  (as  used  by  the  EKF),  the  extended  informa¬ 
tion  filter  (EIF)  maintains  an  estimate  of  the  inverse  covariance.  The  two  representations  are 
statistically  equivalent  (and  therefore  have  the  same  failure  modes).  But,  the  inverse  covariance 
is  often  approximately  sparse,  leading  to  much  more  efficient  approximate  computation  [111]. 

10.2.2  Spectral  State  Space  Discovery  and  System  Identification 

System  identification  algorithms  attempt  to  learn  dynamical  system  parameters  such  as  a  state 
space,  a  dynamics  model  (motion  model),  and  an  observation  model  (measurement  model)  di¬ 
rectly  from  samples  of  observations  and  actions.  In  the  last  few  years,  spectral  system  iden¬ 
tification  algorithms  have  become  popular;  these  algorithms  learn  a  state  space  via  a  spectral 
decomposition  of  a  carefully  designed  matrix  of  observable  features,  then  find  transition  and 
observation  models  by  linear  regressions  involving  the  learned  states.  Originally,  subspace  iden¬ 
tification  algorithms  were  almost  exclusively  used  for  linear  system  identification  [117],  but  re- 
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Figure  10.1:  A  general  principle  for  state  space  discovery.  We  can  think  of  state  as  a  statistic 
of  history  that  is  minimally  sufficient  to  predict  future  observations.  If  the  bottleneck  is  a  rank 
constraint,  then  we  get  a  spectral  method. 

cently,  similar  spectral  algorithms  have  been  used  to  learn  models  of  partially  observable  nonlin¬ 
ear  dynamical  systems  such  as  HMMs  [42,  90]  and  PSRs  [12,  14,  15,  84].  All  of  these  spectral 
algorithms  share  a  strategy  for  state  space  discovery:  they  learn  a  state  space  via  a  spectral  de¬ 
composition  of  a  matrix  of  observations  (Figure  10.1),  resulting  in  a  linear  observation  function, 
and  then  they  learn  a  model  of  the  dynamics  in  the  learned  low-dimensional  state  space.  This  is 
a  powerful  and  appealing  approach:  the  resulting  algorithms  are  statistically  consistent,  and  they 
are  easy  to  implement  with  efficient  linear  algebra  operations.  In  contrast,  batch  optimization 
of  likelihood  (e.g.,  via  the  popular  expectation  maximization  (EM)  algorithm)  is  only  known  to 
be  consistent  if  we  find  the  global  optimum  of  the  likelihood  function — typically  an  impractical 
requirement. 

As  we  will  see  in  Section  10.3,  we  can  view  the  range-only  SLAM  problem  as  an  instance 
of  spectral  state  space  discovery.  And,  the  Appendix  (Sec.  12.2.3)  discusses  how  to  identify 
transition  and  measurement  models  given  the  learned  states.  The  same  properties  that  make 
spectral  methods  appealing  for  system  identification  carry  over  to  our  spectral  SLAM  algorithm: 
computational  efficiency,  statistical  consistency,  and  finite-sample  error  bounds. 

10.2.3  Orthographic  Structure  From  Motion 

In  some  ways  the  orthographic  structure  from  motion  (SfM)  problem  in  vision  [113]  is  very 
similar  to  the  SLAM  problem:  the  goal  is  to  recover  scene  geometry  and  camera  rotations  from 
a  sequence  of  images  (compare  with  landmark  geometry  and  robot  poses  from  a  sequence  of 
range  observations).  And  in  fact,  one  popular  solution  for  SfM  is  very  similar  to  the  state  space 
discovery  step  in  spectral  state  space  identification.  The  key  idea  in  spectral  SfM  is  that  is  that  an 
image  sequence  can  be  represented  as  a  2 F  x  P  measurement  matrix  IF,  containing  the  horizontal 
and  vertical  coordinates  of  P  points  tracked  through  F  frames.  If  the  images  are  the  result  of  an 
orthographic  camera  projection,  then  it  is  possible  to  show  that  rank  (IF)  =  3.  As  a  consequence, 
the  measurement  matrix  can  be  factored  into  the  product  of  two  matrices  U  and  V,  where  U 
contains  the  3d  positions  of  the  features  and  V  contains  the  camera  axis  rotations  [113].  With 
respect  to  system  identification,  it  is  possible  to  interpret  the  matrix  U  as  an  observation  model 
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and  V  as  an  estimate  of  the  system  state.  Inspired  by  SfM,  we  reformulate  range-only  SLAM 
problem  in  a  similar  way  in  Section  10.3,  and  then  similarly  solve  the  problem  with  a  spectral 
learning  algorithm.  Also  similar  to  SfM,  we  examine  the  identifiability  of  our  factorization,  and 
give  a  metric  upgrade  procedure  which  extracts  additional  geometric  information  beyond  what 
the  factorization  gives  us. 

10.2.4  Dimensionality-reduction-based  Methods  for  Mapping 

Dimensionality  reduction  methods  have  recently  provided  an  alternative  to  more  traditional 
likelihood-based  methods  for  mapping.  In  particular,  the  problem  of  finding  a  good  map  can 
be  viewed  as  finding  a  (possibly  nonlinear)  embedding  of  sensor  data  via  methods  like  multidi¬ 
mensional  scaling  (MDS)  and  manifold  learning. 

For  example,  MDS  has  been  used  to  determine  a  Euclidean  map  of  sensor  locations  where 
there  is  no  distinction  between  landmark  positions  and  robot  positions  [86]:  instead  all-to-all 
range  measurements  are  assumed  for  a  set  of  landmarks.  If  some  pairwise  measurements  are 
not  available,  these  measurements  can  be  approximated  by  some  interpolation  method,  e.g.  the 
geodesic  distance  between  the  landmarks  [86,  108]. 

Our  problem  differs  from  this  previous  work:  in  contrast  to  MDS,  we  have  no  landmark- 
to-landmark  measurements  and  only  inaccurate  robot-to-robot  measurements  (from  odometry, 
which  may  not  be  present,  and  which  often  has  significant  errors  when  integrated  over  more  than 
a  short  distance).  Additionally,  our  smaller  set  of  measurements  introduces  additional  challenges 
not  present  in  classical  MDS:  linear  methods  can  recover  the  positions  only  up  to  a  linear  trans¬ 
formation.  This  ambiguity  forces  changes  compared  to  the  MDS  algorithm:  while  MDS  factors 
the  all-to-all  matrix  of  squared  ranges,  in  Sec.  10.3.1  we  factor  only  a  block  of  this  matrix,  then 
use  either  a  metric  upgrade  step  or  a  few  global  position  measurements  to  resolve  the  ambiguity. 

A  popular  alternative  to  linear  dimensionality  reduction  techniques  like  classical  MDS  is 
manifold  learning :  nonlinearly  mapping  sensor  inputs  to  a  feature  space  that  “unfolds”  the  man¬ 
ifold  on  which  the  data  lies  and  then  applying  dimensionality  reduction.  Such  nonlinear  dimen¬ 
sionality  reduction  has  been  used  to  leam  maps  of  wi-fi  networks  and  landmark  locations  when 
sensory  data  is  thought  to  be  nonlinearly  related  to  the  underlying  Eucidean  space  in  which  the 
landmarks  lie  [8,  32,  124].  Unlike  theses  approaches,  we  show  that  linear  dimensionality  reduc¬ 
tion  is  sufficient  to  solve  the  range-only  SLAM  problem.  (In  particular,  [124]  suggests  solving 
range-only  mapping  using  nonlinear  dimensionality  reduction.  We  not  only  show  that  this  is  un¬ 
necessary,  but  additionally  show  that  linear  dimensionality  reduction  is  sufficient  for  localization 
as  well.)  This  greatly  simplifies  the  learning  algorithm  and  allows  us  to  provide  strong  statistical 
guarantees  for  the  mapping  portion  of  SLAM  (Sec.  10.3.3). 


10.3  State  Space  Discovery  and  Spectral  SLAM 

We  start  with  SLAM  from  range  data  without  odometry.  Lor  now,  we  assume  no  noise,  no 
missing  data,  and  batch  processing.  We  will  generalize  below:  Sec.  10.3.2  discusses  how  to 
recover  robot  orientation,  Sec.  10.3.3  discusses  noise,  and  Sec.  10.3.4  discusses  missing  data  and 
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online  SLAM.  In  the  Appendix  (Section  12.2.3)  we  discuss  learning  motion  and  measurement 
models. 


10.3.1  Range-only  SLAM  as  Matrix  Factorization 

Consider  the  matrix  Y  6  M;VxT  of  squared  ranges,  with  N  >  4  landmarks  and  T  >  4  time  steps: 


Y 


d2 
a  1 1 

d\  2  • 

■  ■  d\T 

dh 

d\ 2  • 

d2rp 

dNl 

d%2  ■ 

■  ■  d%T 

(10.3) 


where  dn^t  is  the  measured  distance  from  the  robot  to  landmark  n  at  time  step  t. 

The  most  basic  version  of  our  spectral  SLAM  method  relies  on  the  insight  that  Y  factors 
according  to  robot  position  (xtl  yt )  and  landmark  position  (mntX,  mn^y).  To  see  why,  note 

d2n,t  =  ( ™2n,x  +  ml,y)  ~  2mn,x  •  xt  -  2mn>y  ■  yt  +  ( x \  +  y2t)  (10.4) 

If  we  write  Cn  =  [{m2nx  +  m*iy)/2,mniX,mnty,  1]T  and  Xt  =  [1,  -xt,  -yt,  (xt2  +  yf)f  2]T,  it 
is  easy  to  see  that  df  t  =  26'J X,.  So,  Y  factors  as  Y  =  CX,  where  C  e  M,Yx4  contains  the 
positions  of  landmarks, 


+  mly)/ 2 

mi,x 

mx,y 

1  ' 
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1 
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and  X  e  R4x7  contains  the  positions  of  the  robot  over  time 


X 


1  ...  1 


—X\  . . .  —xt 

-y\  •  •  •  -ijt 

(xl  +  yf)/ 2  ...  (4  +  ^)/2 


(10.6) 


If  we  can  recover  C  and  X,  we  can  read  off  the  solution  to  the  SLAM  problem.  The  fact  that 
Y’s  rank  is  at  most  4  suggests  that  we  might  be  able  to  use  a  rank-revealing  factorization  of  Y, 
such  as  the  singular  value  decomposition,  to  find  C  and  X.  Unfortunately,  such  a  factorization 
only  determines  C  and  X  up  to  a  linear  transform:  given  an  invertible  matrix  S,  we  can  write 

Y  =  CX  =  CS~  lSX.  Therefore,  factorization  can  only  hope  to  recover  U  =  G',S'~ 1  and 

V  =  SX. 

To  upgrade  the  factors  U  and  V  to  a  full  metric  map,  we  have  two  options.  If  global  posi¬ 
tion  estimates  are  available  for  at  least  four  landmarks,  we  can  learn  the  transform  S  via  linear 
regression,  and  so  recover  the  original  C  and  X.  This  method  works  as  long  as  we  know  at  least 
four  landmark  positions.  Figure  10.2A  shows  a  simulated  example. 
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Figure  10.2:  Spectral  SLAM  on  simulated  data.  See  Section  10.4.1  for  details.  A.)  Randomly 
generated  landmarks  (6  of  them)  and  robot  path  through  the  environment  (500  timesteps).  A  SVD 
of  the  squared  distance  matrix  recovers  a  linear  transform  of  the  landmark  and  robot  positions. 
Given  the  coordinates  of  4  landmarks,  we  can  recover  the  landmark  and  robot  positions  in  their 
original  coordinates;  or,  since  500  >  9,  we  can  recover  positions  up  to  an  orthogonal  transform 
with  no  additional  information.  Despite  noisy  observations,  the  robot  recovers  the  true  path 
and  landmark  positions  with  very  high  accuracy.  B.)  The  convergence  of  the  observation  model 
(7,5:6  for  the  remaining  two  landmarks:  mean  Frobenius-norm  error  vs.  number  of  range  readings 
received,  averaged  over  1000  randomly  generated  pairs  of  robot  paths  and  environments.  Error 
bars  indicate  95%  confidence  intervals. 


On  the  other  hand,  if  no  global  positions  are  known,  the  best  we  can  hope  to  do  is  recover 
landmark  and  robot  positions  up  to  an  orthogonal  transform  (translation,  rotation,  and  reflection). 
It  turns  out  that  Eqs.  (10.5-10.6)  provide  enough  additional  geometric  constraints  to  do  so:  in  the 
Appendix  (Sec.  12.2.1)  we  show  that,  if  we  have  at  least  9  time  steps  and  at  least  9  landmarks, 
and  if  each  of  these  point  sets  is  non-singular  in  an  appropriate  sense,  then  we  can  compute 
the  metric  upgrade  in  closed  form.  The  idea  is  to  fit  a  quadratic  surface  to  the  rows  of  U,  then 
change  coordinates  so  that  the  surface  becomes  the  function  in  (10.5).  (By  contrast,  the  usual 
metric  upgrade  for  orthographic  structure  from  motion  [113],  which  uses  the  constraint  that 
camera  projection  matrices  are  orthogonal,  requires  a  nonlinear  optimization.) 


10.3.2  SLAM  with  Headings 

In  addition  to  location,  we  often  want  the  robot’s  global  heading  6.  We  could  get  headings 
by  post-processing  our  learned  positions,  but  in  practice  we  can  reduce  variance  by  learning 
positions  and  headings  simultaneously.  We  do  so  by  adding  more  features  to  our  measurement 
matrix:  differences  between  successive  pairs  of  squared  distances,  scaled  by  velocity  (which  we 
can  estimate  from  odometry).  Since  we  need  pairs  of  time  steps,  we  now  have  Y  e  M  2  77x7 
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As  before,  we  can  factor  Y  into  a  robot  state  matrix  and  a  landmark  matrix.  The  key  new 
observation  is  that  we  can  write  the  new  features  in  terms  of  cos(0)  and  sin(0): 

jjUfi  ~  dl,t  _  _  mn,x(xt+ 1  -  xt)  _  mn.y(yt+l  -  yt)  x^+1  -  xt2  +  y%+1  -  yj 
2vt  vt  vt  2vt 

=  -  rnn,x  cos (6>t)  -  mn)2/  sin(0t)  +  — —  +  ^t+1 — —  (10.8) 

2Vt 

From  Eq.  10.4  and  Eq.  10.8  it  is  easy  to  see  that  Y  has  rank  at  most  7  (exactly  7  if  the  robot 
path  and  landmark  positions  are  not  singular):  we  have  Y  =  CX,  where  C  e  M7Vx7  contains 
functions  of  landmark  positions  and  X  e  M7xT  contains  functions  of  robot  state. 


(™l,x  +  mly)/2 

mi,x 

mhy 

1 

0 

0 

0 

™2N,X  +  mN,y)/2 

?TlN,x 

1TlN,y 

1 

0 

0 

0 

0 

0 

0 

0 

mltX 

mi,y 

1 

(10.9) 


0  0  0  0  mjv, x  iTiN,y  1 


1 

l 

-Xi 

-XT- 1 

-yi 

1  IjT—  i 

(xi  +  yf)/ 2  ... 

(xt-  i  +  Vt-  i)/2 

—  cos(6li) 

-  cos(6»t_i) 

—  sin(0!) 

-sin(6>T_i) 

x\-x\+y\-y\ 

Xjp_ \~\~Vjp  Vt—1 

2vi 

2vt~i 

(10.10) 


As  with  the  basic  SLAM  algorithm  in  Section  10.3.1,  we  can  factor  Y  using  SVD,  this  time 
keeping  7  singular  values.  To  make  the  state  space  interpretable,  we  can  then  look  at  the  top 
part  of  the  learned  transform  of  C:  as  long  as  we  have  at  least  four  landmarks  in  non-singular 
position,  this  block  will  have  exactly  a  three-dimensional  nullspace  (due  to  the  three  columns  of 
zeros  in  the  top  part  of  C ).  After  eliminating  this  nullspace,  we  can  proceed  as  before  to  learn 
S  and  make  the  state  space  interpretable:  either  use  the  coordinates  of  at  least  4  landmarks  as 
regression  targets,  or  perform  a  metric  upgrade.  (See  the  Appendix,  Sec.  12.2.1,  for  details). 
Once  we  have  positions,  we  can  recover  headings  as  angles  between  successive  positions. 
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Algorithm  1  Spectral  SLAM 

In:  i.i.d.  pairs  of  observations  {ot)  «/  }/!, ;  optional:  measurement  model  for  >  4  landmarks  Cl:4 
Out:  measurement  model  (map)  C,  robot  locations  X  (the  fth  column  is  location  at  time  t) 

1:  Collect  observations  and  odometry  into  a  matrix  Y  (Eq.  10.7) 

2:  Find  the  the  top  7  singular  values  and  vectors:  ( U ,  A,  VT)  SVD(V,  7) 

The  transformed  measurement  matrix  is  C S~l  =  U  and  robot  states  are  SX  =  AVT . 

3:  Find  S  via  linear  regression  (from  U  to  C\a)  or  metric  upgrade  (see  Appendix) 
and  return  C  =  US  and  X  =  S^AV7 


10.3.3  A  Spectral  SLAM  Algorithm 

The  matrix  factorizations  of  Secs.  10.3.1  and  10.3.2  suggest  a  straightforward  SEAM  algo¬ 
rithm,  Alg.  1:  build  an  empirical  estimate  Y  of  Y  by  sampling  observations  as  the  robot  traverses 
its  environment,  then  apply  a  rank-7  thin  SVD,  discarding  the  remaining  singular  values  to  sup¬ 
press  noise. 


(U,  A,  VT)  <-  SVD(y,  7) 


(10.11) 


Following  Section  10.3.2,  the  left  singular  vectors  U  are  an  estimate  of  our  transformed  mea¬ 
surement  matrix  CS _1,  and  the  weighted  right  singular  vectors  AVT  are  an  estimate  of  our 
transformed  robot  state  SX.  We  can  then  learn  S  via  regression  or  metric  upgrade. 


Statistical  Consistency  and  Sample  Complexity  Fet  M  e  RNxN  be  the  true  observation  co- 
variance  for  a  randomly  sampled  robot  position,  and  let  M  =  j.  Y  Y  be  the  empirical  covariance 
estimated  from  T  observations.  Then  the  true  and  estimated  measurement  models  are  the  top  sin¬ 
gular  vectors  of  M  and  M.  Assuming  that  the  noise  in  M  is  zero-mean,  as  we  include  more  data 
in  our  averages,  we  will  show  below  that  the  law  of  large  numbers  guarantees  that  M  converges 
to  the  true  covariance  M.  So,  our  learning  algorithm  is  consistent  for  estimating  the  range  of  M, 
i.e.,  the  landmark  locations.  (The  estimated  robot  positions  will  typically  not  converge,  since  we 
typically  have  a  bounded  effective  number  of  observations  relevant  to  each  robot  position.  But, 
as  we  see  each  landmark  again  and  again,  the  robot  position  errors  will  average  out,  and  we  will 
recover  the  true  map.) 

In  more  detail,  we  can  give  finite-sample  bounds  on  the  error  in  recovering  the  true  factors. 
For  simplicity  of  presentation  we  assume  that  noise  is  i.i.d.,  although  our  algorithm  will  work  for 
any  zero-mean  noise  process  with  a  finite  mixing  time.  (The  error  bounds  will  of  course  become 
weaker  in  proportion  to  mixing  time,  since  we  gain  less  new  information  per  observation.)  The 
argument  (see  the  Appendix,  Sec.  12.2.2,  for  details)  has  two  pieces:  standard  concentration 
bounds  show  that  each  element  of  our  estimated  covariance  approaches  its  population  value; 
then  the  continuity  of  the  SVD  shows  that  the  learned  subspace  also  approaches  its  true  value. 
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The  final  bound  is: 


NcJ 

II  sin  THU  <  - y - - —  (10.12) 

7 

where  T  is  the  vector  of  canonical  angles  between  the  learned  subspace  and  the  true  one,  c  is  a 
constant  depending  on  our  error  distribution,  and  7  is  the  true  smallest  nonzero  eigenvalue  of  the 
covariance.  In  particular,  this  bound  means  that  the  sample  complexity  is  0((2)  to  achieve  error 

c 

10.3.4  Extensions:  Missing  Data,  Online  SLAM,  and  System  ID 

Missing  data  So  far  we  have  assumed  that  we  receive  range  readings  to  all  landmarks  at  each 
time  step.  In  practice  this  assumption  is  rarely  satisfied:  we  may  receive  range  readings  asyn¬ 
chronously,  some  range  readings  may  be  missing  entirely,  and  it  is  often  the  case  that  odometry 
data  is  sampled  faster  than  range  readings.  Here  we  outline  two  methods  for  overcoming  this 
practical  difficulty. 

First,  if  a  relatively  small  number  of  observations  are  missing,  we  can  use  standard  ap¬ 
proaches  for  factorization  with  missing  data.  For  example,  probabilistic  PCA  [112]  estimates 
the  missing  entries  via  an  EM  algorithm,  and  matrix  completion  [21]  uses  a  trace-norm  penalty 
to  recover  a  low-rank  factorization  with  high  probability.  However,  for  range-only  data,  often  the 
fraction  of  missing  data  is  high  and  the  missing  values  are  structural  rather  than  random. 

The  second  approach  is  interpolation:  we  divide  the  data  into  overlapping  subsets  and  then 
use  local  odometry  information  to  interpolate  the  range  data  within  each  subset.  To  interpolate 
the  data,  we  estimate  a  robot  path  by  dead  reckoning.  For  each  point  in  the  dead  reckoning 
path  we  build  the  feature  representation  [1,  —x,  —y,  (x2  +  y2)/2] ' .  We  then  leam  a  linear  model 
that  predicts  a  squared  range  reading  from  these  features  (for  the  data  points  where  range  is 
available),  as  in  Eq.  10.4.  Next  we  predict  the  squared  range  along  the  entire  path.  Finally 
we  build  the  matrix  Y  by  averaging  the  locally  interpolated  range  readings.  This  interpolation 
approach  works  much  better  in  practice  than  the  fully  probabilistic  approaches  mentioned  above, 
and  was  used  in  our  experiments  in  Section  10.4. 

Online  Spectral  SLAM  The  algorithms  developed  in  this  section  so  far  have  had  an  impor¬ 
tant  drawback:  unlike  many  SLAM  algorithms,  they  are  batch  methods  not  online  ones.  The 
extension  to  online  SLAM  is  straightforward:  instead  of  first  estimating  Y  and  then  performing 
a  SVD,  we  sequentially  estimate  our  factors  ( U ,  A,  I/T)  via  online  SVD  [15,  20]. 

Robot  Filtering  and  System  Identification  So  far,  our  algorithms  have  not  directly  used  (or 
needed)  a  robot  motion  model  in  the  learned  state  space.  However,  an  explicit  motion  model 
is  required  if  we  want  to  predict  future  sensor  readings  or  plan  a  course  of  action.  We  have 
two  choices:  we  can  derive  a  motion  model  from  our  learned  transformation  S  between  latent 
states  and  physical  locations,  or  we  can  learn  a  motion  model  directly  from  data  using  spectral 
system  identification.  More  details  about  both  of  these  approaches  can  be  found  in  the  Appendix, 
Sec.  12.2.3. 


120 


10.4  Experimental  Results 


We  perform  several  SLAM  and  robot  navigation  experiments  to  illustrate  and  test  the  ideas  pro¬ 
posed  in  this  chapter.  First  we  show  how  our  methods  work  in  theory  with  synthetic  experiments 
where  complete  observations  are  received  at  each  point  in  time  and  i.i.d.  noise  is  sampled  from 
a  multivariate  Gaussian  distribution.  Next  we  demonstrate  our  algorithm  on  data  collected  from 
a  real-world  robotic  system  with  substantial  amounts  of  missing  data.  Experiments  were  per¬ 
formed  in  Matlab,  on  a  2.66  GHz  Intel  Core  i7  computer  with  8  GB  of  RAM.  In  contrast  to 
batch  nonlinear  optimization  approaches  to  SLAM,  the  spectral  learning  methods  described  in 
this  chapter  are  very  fast,  usually  taking  less  than  a  second  to  run. 

10.4.1  Synthetic  Experiments 

Our  simulator  randomly  places  6  landmarks  in  a  2-D  environment.  A  simulated  robot  then  ran¬ 
domly  moves  through  the  environment  for  500  time  steps  and  receives  a  range  reading  to  each 
one  of  the  landmarks  at  each  time  step.  The  range  readings  are  perturbed  by  noise  sampled  from 
a  Gaussian  distribution  with  variance  equal  to  1%  of  the  range.  Given  this  data,  we  apply  the  al¬ 
gorithm  from  Section  10.3.3  to  solve  the  SLAM  problem.  We  use  the  coordinates  of  4  landmarks 
to  leam  the  linear  transform  S  and  recover  the  true  state  space,  as  shown  in  Ligure  10.2A.  The 
results  indicate  that  we  can  accurately  recover  both  the  landmark  locations  and  the  robot  path. 

We  also  investigated  the  empirical  convergence  rate  of  our  observation  model  (and  therefore 
the  map)  as  the  number  of  range  readings  increased.  To  do  so,  we  generated  1000  different  ran¬ 
dom  pairs  of  environments  and  robot  paths.  Lor  each  pair,  we  repeatedly  performed  our  spectral 
SLAM  algorithm  on  increasingly  large  numbers  of  range  readings  and  looked  at  the  difference 
between  our  estimated  measurement  model  (the  robot’s  map)  and  the  true  measurement  model, 
excluding  the  landmarks  that  we  used  for  reconstruction:  1 1 Cs:6  —  C^\\jr.  The  results  are  shown 
in  Ligure  10. 2B,  and  show  that  our  estimates  steadily  converge  to  the  true  model,  corroborating 
our  theoretical  results  (in  Section  10.3.3  and  the  Appendix). 

10.4.2  Robotic  Experiments 

We  used  two  freely  available  range-only  SLAM  data  sets  collected  from  an  autonomous  lawn 
mowing  robot  [27],  shown  in  Lig.  10. 3A.1  These  “Plaza”  datasets  were  collected  via  radio  nodes 
from  Multispectral  Solutions  that  use  time-of-flight  of  ultra-wide -band  signals  to  provide  inter¬ 
node  ranging  measurements.  (Additional  details  on  the  experimental  setup  can  be  found  in  [27].) 
This  system  produces  a  time- stamped  range  estimate  between  the  mobile  robot  and  stationary 
nodes  (landmarks)  in  the  environment.  The  landmark  radio  nodes  are  placed  atop  traffic  cones 
approximately  138cm  above  the  ground  throughout  the  environment,  and  one  node  was  placed 
on  top  of  the  center  of  the  robot’s  coordinate  frame  (also  138cm  above  the  ground).  The  robot 
odometry  (dead  reckoning)  comes  from  an  onboard  fiberoptic  gyro  and  wheel  encoders.  The  two 
environmental  setups,  including  the  locations  of  the  landmarks,  the  dead  reckoning  paths,  and 

1  http  ://www.  frc  .ri  .emu.  edu/projects/emergencyresponse/RangeData/index.  html 
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Figure  10.3:  The  autonomous  lawn  mower  and  spectral  SLAM.  A.)  The  robotic  lawn  mower 
platform.  B.)  In  the  first  experiment,  the  robot  traveled  1.9km  receiving  3,529  range  measure¬ 
ments.  This  path  minimizes  the  effect  of  heading  error  by  balancing  the  number  of  left  turns 
with  an  equal  number  of  right  turns  in  the  robot’s  odometry  (a  commonly  used  path  pattern  in 
lawn  mowing  applications).  The  light  blue  path  indicates  the  robot’s  true  path  in  the  environ¬ 
ment,  light  purple  indicates  dead-reckoning  path,  and  dark  blue  indicates  the  spectral  SLAM 
localization  result.  C.)  In  the  second  experiment,  the  robot  traveled  1 .3km  receiving  1,816  range 
measurements.  This  path  highlights  the  effect  of  heading  error  on  dead  reckoning  performance 
by  turning  in  the  same  direction  repeatedly.  Again,  spectral  SLAM  is  able  to  accurately  recover 
the  robot’s  path. 


the  ground  truth  paths,  are  shown  in  Figure  10.3B-C.  The  ground  truth  paths  have  2cm  accuracy 
according  to  [27]. 

The  two  Plaza  datasets  that  we  used  to  evaluate  our  algorithm  have  very  different  charac¬ 
teristics.  In  “Plaza  1,”  the  robot  travelled  1.9km,  occupied  9,658  distinct  poses,  and  received 
3,529  range  measurements.  The  path  taken  is  a  typical  lawn  mowing  pattern  that  balances  left 
turns  with  an  equal  number  of  right  turns;  this  type  of  pattern  minimizes  the  effect  of  heading 
error.  In  “Plaza  2,”  the  robot  travelled  1.3km,  occupied  4,091  poses,  and  received  1,816  range 
measurements.  The  path  taken  is  a  loop  which  amplifies  the  effect  of  heading  error.  The  two 
data  sets  were  both  very  sparse,  with  approximately  11  time  steps  (and  up  to  500  steps)  be¬ 
tween  range  readings  for  the  worst  landmark.  We  first  interpolated  the  missing  range  readings 
with  the  method  of  Section  10.3.4.  Then  we  applied  the  rank-7  spectral  SLAM  algorithm  of 
Section  10.3.3;  the  results  are  depicted  in  Figure  10.3B-C.  Qualitatively,  we  see  that  the  robot’s 
localization  path  conforms  to  the  true  path. 

In  addition  to  the  qualitative  results,  we  quantitatively  compared  spectral  SLAM  to  a  number 
of  different  competing  range-only  SLAM  algorithms.  The  localization  root  mean  squared  error 
(RMSE)  in  meters  for  each  algorithm  is  shown  in  Figure  10.4.  The  baseline  is  dead  reckon¬ 
ing  (using  only  the  robot’s  odometry  information).  Next  are  several  standard  online  range-only 
SLAM  algorithms,  summarized  in  [27].  These  algorithms  included  the  Cartesian  EKF,  Fast- 
SLAM  [69]  with  5,000  particles,  and  the  ROP-EKF  [28].  These  previous  results  only  reported 
the  RMSE  for  the  last  10%  of  the  path,  which  is  typically  the  best  10%  of  the  path  (since  it  gives 
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the  most  time  to  recover  from  initialization  problems).  The  full  path  localization  error  can  be 
considerably  worse,  particularly  for  the  initial  portion  of  the  path — see  Fig.  5  (right)  of  [28]. 

We  also  compared  to  batch  nonlinear  optimization,  via  Gauss-Newton  as  implemented  in 
Matlab’s  fminunc  (see  [53]  for  details).  This  approach  to  solving  the  range-only  SLAM  prob¬ 
lem  can  be  very  data  efficient,  but  is  subject  to  local  optima  and  is  very  computationally  inten¬ 
sive.  We  followed  the  suggestions  of  [53]  and  initialized  with  the  dead-reckoning  estimate  of  the 
robot’s  path.  The  algorithm  took  roughly  2.5  hours  to  converge  on  Plaza  1,  and  45  minutes  to 
converge  on  Plaza  2.  Under  most  evaluation  metrics,  the  nonlinear  batch  algorithm  handily  beats 
the  EKF-based  alternatives. 

Finally,  we  ran  our  spectral  SLAM  algorithm  on  the  same  data  sets.  In  contrast  to  Gauss- 
Newton,  spectral  SLAM  is  statistically  consistent,  and  much  faster:  the  bulk  of  the  computation 
is  the  fixed-rank  SVD,  so  the  time  complexity  of  the  algorithm  is  0((2N)2T)  where  N  is  the 
number  of  landmarks  and  T  is  the  number  of  time  steps.  Empirically,  spectral  SLAM  produced 
results  that  were  comparable  to  batch  optimization  in  3-4  orders  of  magnitude  less  time  (see 
Figure  10.4). 

Spectral  SLAM  can  also  be  used  as  an  initialization  procedure  for  nonlinear  batch  optimiza¬ 
tion.  This  strategy  combines  the  best  of  both  algorithms  by  allowing  the  locally  optimal  nonlinear 
optimization  procedure  to  start  from  a  theoretically  guaranteed  good  starting  point.  Therefore, 
the  local  optimum  found  by  nonlinear  batch  optimization  should  be  no  worse  than  the  spectral 
SLAM  solution  and  likely  much  better  than  the  batch  optimization  seeded  by  dead-reckoning. 
Empirically,  we  found  this  to  be  the  case  (Figure  10.4).  If  time  and  computational  resources  are 
scarce,  then  we  believe  that  spectral  SLAM  is  clearly  the  best  approach;  if  computation  is  not 
an  issue,  the  best  results  will  almost  certainly  be  found  by  refining  the  spectral  SLAM  solution 
using  a  nonlinear  batch  optimization  procedure. 


10.5  Conclusion 

We  proposed  a  novel  solution  for  the  range-only  SLAM  problem  that  differs  substantially  from 
previous  approaches.  The  essence  of  this  new  approach  is  to  formulate  SLAM  as  a  factorization 
problem,  which  allows  us  to  derive  a  local-minimum  free  spectral  learning  method  that  is  closely 
related  to  SfM  and  spectral  approaches  to  system  identification.  We  provide  theoretical  guaran¬ 
tees  for  our  algorithm,  discuss  how  to  derive  an  online  algorithm,  and  show  how  to  generalize  to 
a  full  robot  system  identification  algorithm.  Finally,  we  demonstrate  that  our  spectral  approach  to 
SLAM  beats  other  state-of-the-art  SLAM  approaches  on  real-world  range-only  SLAM  problems. 
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Method 

Plaza  1 

Plaza  2 

Dead  Reckoning  (full  path) 

15.92m 

27.28m 

Cartesian  EKF  (last,  best  10%) 

0.94m 

0.92m 

FastSLAM  (last,  best  10%) 

0.73m 

1.14m 

ROP  EKF  (last,  best  10%) 

0.65m 

0.87m 

Batch  Opt.  (worst  10%) 

1.04m 

0.45m 

Batch  Opt.  (last  10%) 

1.01m 

0.45m 

Batch  Opt.  (best  10%) 

0.56m 

0.20m 

Batch  Opt.  (full  path) 

0.79m 

0.33m 

Spectral  SLAM  (worst  10%) 

1.01m 

0.51m 

Spectral  SLAM  (last  10%) 

0.98m 

0.51m 

Spectral  SLAM  (best  10%) 

0.59m 

0.22m 

Spectral  SLAM  (full  path) 

0.79m 

0.35m 

Spectral  +  Batch  Optimization  (worst  10%) 

0.89m 

0.40m 

Spectral  +  Batch  Optimization  (last  10%) 

0.81m 

0.32m 

Spectral  +  Batch  Optimization  (best  10%) 

0.54m 

0.18m 

Spectral  +  Batch  Optimization  (full  path) 

0.69m 

0.30m 

Runtime  (seconds) 
Batch  Opt. 
Spectral  SLAM 


9264.55 


Plaza  1  Plaza  2 


Figure  10.4:  Comparison  of  Range-Only  SLAM  Algorithms.  The  table  shows  Localization 
RMSE.  Spectral  SLAM  has  localization  accuracy  comparable  to  batch  optimization  on  its  own. 
The  best  results  (boldface  entries)  are  obtained  by  initializing  nonlinear  batch  optimization  with 
the  spectral  SLAM  solution.  The  graph  compares  runtime  of  Gauss-Newton  batch  optimization 
with  spectral  SLAM.  Empirically,  spectral  SLAM  is  3-4  orders  of  magnitude  faster  than  batch 
optimization  on  the  autonomous  lawnmower  datasets. 
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Part  III 
Conclusions 
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Chapter  11 


Discussion 


Learning  models  of  dynamical  systems  is  a  fundamental  task  for  predicting,  classifying,  and  sim¬ 
ulating  observations  as  well  as  reasoning  about  actions  in  time  series.  In  this  thesis  we  focused 
on  and  combined  two  complimentary  ideas:  predictive  representations  that  model  state  as  an 
expectation  of  the  future  and  spectral  algorithms  for  learning  a  low-dimensional  state  represen¬ 
tation.  The  key  idea  is  that  predictive  representations  allow  dynamical  system  parameters  to  be 
written  in  terms  of  obserx’able  quantities;  these  observable  quantities  can  then  be  leveraged  to 
find  a  valid  state  space  and  model  parameters. 

This  framework  contrasts  sharply  with  previous  approaches  to  reasoning  about  and  learning 
dynamical  systems:  the  predominant  class  of  probabilistic  models  for  representing  dynamical 
systems  is  latent  variable  models ,  which  assume  that  observations  are  generated  by  an  unobserv¬ 
able  state.  However,  learning  latent  variable  models  is  very  difficult.  The  most  popular  approach 
is  the  EM  algorithm,  a  local  search  heuristic  that  alternatively  posits  values  for  the  latent  variables 
and  then  uses  these  variables  to  estimate  the  system  parameters.  Unfortunately,  EM  is  generally 
not  a  great  approach:  the  algorithm  is  highly  sensitive  to  initial  conditions,  takes  a  long  time  to 
converge,  and  is  often  numerically  brittle  in  practice. 

In  contrast  to  latent  variable  models  and  EM,  the  algorithms  presented  in  this  thesis  have 
excellent  theoretical  properties  that  translate  into  good  practical  performance.  In  particular,  the 
spectral  algorithms  here  are  almost  all  statistically  consistent  and  thus  have  no  local  optima. 
Additionally,  the  matrix  algebra  that  is  used  to  leam  these  models  is  very  fast  and  numerically 
stable.  We  highlight  some  of  the  contributed  algorithms  here. 

First,  we  provided  a  novel  spectral  algorithm  for  learning  an  observable  representation  of  a 
Kalman  filter  that  is  closely  linked  to  the  spectral  learning  algorithms  for  the  more  expressive 
models  that  follow.  We  address  the  issue  of  instability  in  Kalman  filters,  proposing  a  constraint- 
generation  algorithm  that  outperforms  previous  methods  in  both  efficiency  and  accuracy. 

Second,  we  developed  a  novel  spectral  learning  algorithm  for  learning  predictive  state  repre¬ 
sentations  (PSRs)  that  outperforms  previous  approaches  to  learning  these  systems.  Our  approach 
is  fast,  accurate,  and  statistically  consistent.  One  of  the  drawbacks  of  PSRs  is  that  if  a  dynamical 
system  has  a  very  large  number  of  different  actions  and  observations,  then  learning  the  parame¬ 
ters  of  such  a  system  can  be  very  difficult.  Therefore,  we  extend  our  representation  so  that  it  can 
be  written  in  terms  of  features  of  actions  and  observations.  We  show  that  this  algorithm  is  also 
consistent  and  has  good  predictive  accuracy.  Additionally,  we  demonstrate  that  our  algorithms 


127 


can  be  used  to  learn  accurate  models  of  complex  environments.  In  particular,  we  leam  a  repre¬ 
sentation  of  an  agent  moving  through  a  simulated  visual  environment  and  then  plan  in  the  learned 
model.  We  show  that  the  resulting  plans  were  close  to  optima  in  the  original  environment.  We 
also  show  how  to  extend  temporal  difference  learning  using  some  of  the  theory  developed  for  our 
spectral  system  identification  algorithm,  and  we  show  that  the  resulting  extension  outperforms 
many  competing  algorithms.  Finally,  we  show  how  dynamical  systems  can  be  represented  non- 
parameterically  in  reproducing  kernel  Hilbert  spaces.  We  develop  a  spectral  learning  algorithm 
for  this  case,  show  its  consistency,  and  demonstrate  that  this  approach  can  lead  to  very  accurate 
predictive  models  in  practice. 

Third,  we  develop  a  spectral  learning  approach  to  solving  the  range-only  simultaneous  lo¬ 
calization  and  mapping  problem.  This  is  a  different  application  of  spectral  learning  compared 
with  the  rest  of  the  thesis,  but  the  results  demonstrate  the  basic  approach  to  learning  a  dynamical 
system  state  has  applications  beyond  stochastic  process  modeling. 

Together,  the  approaches  enumerated  here  show  how  predictive  representations  and  spectral 
learning  algorithms  can  be  unified  into  a  powerful  framework  for  learning  predictive  dynamical 
system  models.  The  research  presented  in  this  thesis  has  the  potential  to  positively  impact  many 
fields  where  sequential  data  modeling  is  important:  robot  sensing  and  planning,  video  modeling, 
activity  recognition  and  user  modeling,  speech  recognition,  bioinformatics,  and  more. 


128 


Part  IV 
Appendix 


129 


Chapter  12 
Appendix 


12.1  Predictive  State  Temporal  Difference  Learning 

12.1.1  Determining  the  Compression  Operator 

We  find  a  compression  operator  V  that  optimally  predicts  test-features  through  the  CCA  bot¬ 
tleneck  defined  by  U.  The  least  squares  estimate  can  be  found  by  minimizing  the  following 
loss 


A  v)  = 


<t>lk  ~  UV0 & 


n 


V  =  arg  rriin  £(C) 

where  ||  •  ||F  denotes  the  Frobenius  norm.  We  can  find  V  by  taking  a  derivative  of  this  loss  C 
with  respect  to  V,  setting  it  to  zero,  and  solving  for  V 


T 


£  =  l  ( M>h  - - uvtf.„)T 


T 


=  l  (tf  * '  4L  -  2 K.k '  uv^k  +  <pzk '  c 1  u 1  uv^k 

dC  =  —  2tr  (<i%kTdVTUT<i>Zk)  +  2tr  (  dVT  UT  UV^k 


dC  =  —  2tr  (dl/Tf/T0L</>nfcT)  +  2tr  (d VTUTUV^k^k) 


■  d£  =  — 2tr  (  dV  U  T, 

d  C 


JT,H 


+  2tr  ( dVTUTUVE 


n,n 


dVT 


=  -2tr  (U'Etm  +2tr  (U'UVZn,n 


0  =  -UTZr,n  +  UTUV^n,n 

V  =  (uTu)-1uT^n(tntn)-1 

=  UtEt,h{Vh,h)-1 
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12.1.2  Experimental  Results 

RR-POMDP 

The  RR-POMDP  parameters  are: 


[m  =  4  hidden  states,  n  —  2  observations,  k  —  3  transition  matrix  rank]. 


0.7829 

0.1036 

0.0399 

0.0736  ' 

0.1036 

0.4237 

0.4262 

0.0465 

O  = 

'  1 

0 

1 

0 

0.0399 

0.4262 

0.4380 

0.0959 

0 

1 

0 

1 

0.0736 

0.0465 

0.0959 

0.7840 

The  discount  factor  is  7  =  0.9. 


Pricing  a  financial  derivative 

Basis  functions  The  fist  16  are  the  basis  functions  suggested  by  Van  Roy;  for  full  description 
and  justification  see  [23,  115].  The  first  functions  consist  of  a  constant,  the  reward,  the  minimal 
and  maximal  returns,  and  how  long  ago  they  occurred: 

1 

G(x) 

min  x(i)  —  1 

i  I . 100 

max  x(i)  —  1 

i  I . 100 

arg  min  x(i)  —  1 

i=l,...,100 

arg  max  x(i)  —  1 

t=l,...,100 

The  next  set  of  basis  functions  summarize  the  characteristics  of  the  basic  shape  of  the  100  day 
sample  path.  They  are  the  inner  product  of  the  path  with  the  first  four  Legendre  polynomial 
degrees.  Let  j  =  i/ 50  —  1. 


Mx)  = 
Mx)  = 
Mx)  = 

<S>a{x)  = 

<t>5(x)  = 

Mx)  = 
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Nonlinear  combinations  of  basis  functions: 


0n(x)  =  02(x)03(x) 

012  (®)  =  02{x)04(x) 

Mx)  =  02(x)0  7(x) 

014  (x)  =  02(x)08(x) 

015  (x)  =  02(x)0g(x) 

016  (x)  =  02(x)01O(x) 

In  order  to  improve  our  results,  we  added  a  large  number  of  additional  basis  functions  to  these 
hand-picked  16.  PSTD  will  compress  these  features  for  us,  so  we  can  use  as  many  additional 
basis  functions  as  we  would  like.  First  we  defined  4  additional  basis  functions  consisting  of  the 
inner  products  of  the  100  day  sample  path  with  the  5th  and  6th  Legende  polynomials  and  we 
added  the  corresponding  nonlinear  combinations  of  basis  functions: 

1  100 

Mx)  = 

i=l 

1  100 

«*)  =  iooZX*) 

i=l 

0ig(x)  =  02{x)0n{x) 

02o(x)  =  02(x)018(x) 

Finally  we  added  the  the  entire  sample  path  and  the  squared  sample  path: 


9  f  35j4  —  30a;2  +  3 
2 

\ 

11  / 63j5  -  70 j 3  +  15 j 


021:120  —  ^1:100 
0121:220  =  ^1:100 


12.2  A  Spectral  Learning  Approach  to  Range  Only  SLAM 

12.2.1  Metric  Upgrade  for  Learned  Map 

In  the  main  body  of  the  paper,  we  assumed  that  global  position  estimates  of  at  least  four  land¬ 
marks  were  known.  When  these  landmarks  are  known,  we  can  recover  all  of  the  estimated 
landmark  positions  and  robot  locations. 

In  many  cases,  however,  no  global  positions  are  known;  the  best  we  can  hope  to  do  is  recover 
landmark  and  robot  positions  up  to  an  orthogonal  transform  (translation,  rotation,  and  reflection). 
It  turns  out  that  Eqs.  (10.5-10.6)  provide  enough  geometric  constraints  to  perform  this  metric 
upgrade,  as  long  as  we  have  at  least  9  landmarks  and  at  least  9  time  steps,  and  as  long  as  C  and 
X  are  nonsingular  in  the  following  sense:  define  the  matrix  C2,  with  the  same  number  of  rows 
as  C  but  10  columns,  whose  7th  row  has  elements  for  1  <  j  <  k  <  4  (in  any  fixed  order). 
Note  that  the  rank  of  C2  can  be  at  most  9:  from  Eq.  10.5,  we  know  that  cf2  +  c*3  —  2 ci)4  =  0, 
and  each  of  the  three  terms  in  this  function  is  a  multiple  of  a  column  of  C2.  We  will  say  that 
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C  is  nonsingular  if  C2  has  rank  exactly  9,  i.e.,  is  rank  deficient  by  exactly  1  dimension.  The 
conditions  for  X  are  analogous,  swapping  rows  for  columns.1 

To  derive  the  metric  upgrade,  suppose  that  we  start  from  an  N  x  4  matrix  U  of  learned 
landmark  coordinates  and  an  4  x  N  matrix  V  of  learned  robot  coordinates  from  the  algorithm 
of  Sec.  10.3.1.  And,  suppose  that  we  have  at  least  9  nonsingular  landmarks  and  robot  positions. 
We  would  like  to  transform  the  learned  coordinates  into  two  new  matrices  C  and  X  such  that 


Cl  ft!  1 

(12.1) 

1  2  1  0 

c4  ~  2C2  2Ca 

(12.2) 

X4  PS  1 

(12.3) 

1  2  1  2 
nr*  _  j  ry*  ^  1  ns*  ^ 

xi  ~  2X2  a  2Xz 

(12.4) 

where  c  is  a  row  of  C  and  a;  is  a  column  of  X. 

At  a  high  level,  we  first  fit  a  quadratic  surface  to  the  rows  of  U,  then  transform  this  surface  so 
that  it  satisfies  Eq.  12.1-12.2,  and  scale  the  surface  so  that  it  satisfies  Eq.  12.3.  Our  surface  will 
then  automatically  also  satisfy  Eq.  12.4,  since  X  must  be  metrically  correct  if  C  is. 

In  more  detail,  we  first  (step  i)  linearly  transform  each  row  of  U  into  approximately  the  form 
(1,7^1,  a, 2,  A, 3):  we  use  linear  regression  to  find  a  coefficient  vector  a  G  M4  such  that  Ea  «  1, 
then  set  R  =  UQ  where  Q  e  M4x3  is  an  orthonormal  basis  for  the  nullspace  of  aT .  After  this 
step,  our  factorization  is  {UTi)(T{lV),  where  T,  =  (a  Q). 

Next  (step  ii)  we  fit  an  implicit  quadratic  surface  to  the  rows  of  R  by  finding  10  coefficients 
bjk  (for  0  <  j  <  k  <  3)  such  that 

0  «  60o  +  &oiA,i  +  b02rit2  +  &03A.3  + 

bn'rli  +  &12A,iA,2  +  &13A,iA,3  +  622^,2  +  ^23  A, 2  A, 3  +  633^,3 

To  do  so,  we  form  a  matrix  S  that  has  the  same  number  of  rows  as  U  but  10  columns.  The 
elements  of  row  i  of  S  are  for  0  <  j  <  k  <  3  (in  any  fixed  order).  Here,  for  convenience, 

we  define  rit 0  =  1  for  all  1.  Then  we  find  a  vector  b  €  M 1 0  that  is  approximately  in  the  nullspace 
of  ST  by  taking  a  singular  value  decomposition  of  S  and  selecting  the  right  singular  vector 
corresponding  to  the  smallest  singular  value.  Using  this  vector,  we  can  define  our  quadratic  as 
0  «  \rT  Hr  +  £Tr  +  b0 ,0,  where  r  is  a  row  of  R,  and  the  Hessian  matrix  II  and  linear  part  £  are 
given  by: 

/  \bu  b\2  ^13  \  /  ken  \ 

H  =  I  b2\  \b22  623  J  ^  =  I  ^02  I 

\  &31  ^32  \b33  J  \  b03  J 

Over  the  next  few  steps  we  will  transform  the  coordinates  in  R  to  bring  our  quadratic  into  the 
form  of  Eq.  12.2:  that  is,  one  coordinate  will  be  a  quadratic  function  of  the  other  two,  there  will 
be  no  linear  or  constant  terms,  and  the  quadratic  part  will  be  spherical  with  coefficient 

'For  intuition,  a  set  of  landmarks  or  robot  positions  that  all  lie  on  the  same  quadratic  surface  (line,  circle, 
parabola,  etc.)  will  be  singular.  Some  higher-order  constraints  will  also  lead  to  singularity;  e.g.,  a  set  of  points  will 
be  singular  if  they  all  satisfy  \{xf  +  yf)x,  +yi  =  0,  since  each  of  the  two  terms  in  this  function  is  a  column  of  C2. 
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We  start  (step  iii)  by  transforming  coordinates  so  that  our  quadratic  has  no  cross-terms,  i.e., 
so  that  its  Hessian  matrix  is  diagonal.  Using  a  3  x  3  singular  value  decomposition,  we  can  factor 
H  =  MH'Mt  so  that  M  is  orthonormal  and  H'  is  diagonal.  If  we  set  R!  =  RM  and  t'  =  Mi , 
and  write  r'  for  a  row  of  R',  we  can  equivalently  write  our  quadratic  as  0  =  \(r')T H'r'  + 
(£')  r'  +  600,  which  has  a  diagonal  Hessian  as  desired.  After  this  step,  our  factorization  is 
(UT1T2)(T2-1Tf  V),  where 


T2 


1  0 
0  M 


Our  next  step  (step  iv)  is  to  turn  our  implicit  quadratic  surface  into  an  explicit  quadratic  function. 
For  this  purpose  we  pick  one  of  the  coordinates  of  R'  and  write  it  as  a  function  of  the  other 
two.  In  order  to  do  so,  we  must  have  zero  as  the  corresponding  diagonal  element  of  the  Hessian 
H' — else  we  cannot  guarantee  that  we  can  solve  for  a  unique  value  of  the  chosen  coordinate.  So, 
we  will  take  the  index  j  such  that  H'Jj  is  minimal,  and  set  H',--  =  0.  Suppose  that  we  pick  the 
last  coordinate,  j  =  3.  (We  can  always  reorder  columns  to  make  this  true;  SVD  software  will 
typically  do  so  automatically.)  Then  our  quadratic  becomes 


0  —  --H’n(r,1)2  +  ~H22(r2)2  +  £[r[  +  i2r2  +  b'3r'3  +  b00 
2^'n(ri)2  +  2^22(r2)2  +  V'lrl  +  ^2r2  +  ^00 


rs  =  ~jr 


-3  L 


Now  (step  v)  we  can  shift  and  rescale  our  coordinates  one  more  time  to  get  our  quadratic  in 
the  desired  form:  translate  so  that  the  linear  and  constant  coefficients  are  0,  and  rescale  so  that 
the  quadratic  coefficients  are  For  the  translation,  we  define  new  coordinates  r"  =  r'  +  c  for 
c  e  K3,  so  that  our  quadratic  becomes 


ci)2  +  Mr 


c2)2  +  4(r 


// 

1 


Cl)  +  ^2 (r2  ~  ^2)  +  ^00 


By  expanding  and  matching  coefficients,  we  know  c  must  satisfy 


0  = 
0  = 


0  =  c3  - 


HI, 


2 1 


fc 2  +  +  7TC2  -  ^oo/4 


(coefficient  of  r”) 
(coefficient  of  r2) 
(constant) 


The  first  two  equations  are  linear  in  ci  and  c2  (and  don’t  contain  c3).  So,  we  can  solve  directly 
for  ci  and  c2;  then  we  can  plug  their  values  into  the  last  equation  to  find  c3.  For  the  scaling, 
the  coefficient  of  r"  is  now  —^jr,  and  that  of  r2  is  now  So,  we  can  just  scale  these  two 

coordinates  separately  to  bring  their  coefficients  to 
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After  this  step,  our  factorization  is  U'V',  where  U'  =  f/T1T2T3  and  V  =  T3  1  T2  '  T,  1V,  and 


T3 


(  1  0 

i'a 

ci  “jjf 
c2  0 

\  c3  0 


0  0  \ 

0  0 

ff 

Q 

TT  /  f-' 

H22 

0  1  / 


The  left  factor  U'  will  now  satisfy  Eq.  12.1-12.2.  We  still  have  one  last  useful  degree  of  freedom: 
if  we  set  C  =  U'T4,  where 

/  1  0  0  0  \ 

T  _  0  [1  0  0 

4  0  0  p,  0 

\  0  0  0  p2  / 

for  any  /j  e  M,  then  67  will  still  satisfy  Eq.  12.1-12.2.  So  (step  vi),  we  will  pick  /r  to  satisfy 
Eq.  12.3:  in  particular,  we  set  /i  =  ^Jmean{V4  .),  so  that  when  we  set  X  =  T4_1K',  the  last  row 
of  X  will  have  mean  1 . 

If  we  have  7  learned  coordinates  in  U  as  in  Sec.  10.3.2,  we  need  to  find  a  subspace  of  4 
coordinates  in  order  to  perform  metric  upgrade.  To  do  so,  we  take  advantage  of  the  special  form 
of  the  correct  answer,  given  in  Eq.  10.9:  in  the  upper  block  of  C  in  Eq.  10.9,  three  coordinates 
are  identically  zero.  Since  U  is  a  linear  transformation  of  C,  there  will  be  three  linear  functions 
of  the  top  block  of  U  that  are  identically  zero  (or  approximately  zero  in  the  presence  of  noise). 
As  long  as  the  landmark  positions  are  nonsingular,  we  can  use  SVD  on  the  top  block  of  U  to 
find  and  remove  these  linear  functions  (by  setting  the  smallest  three  singular  values  to  zero),  then 
proceed  as  above  with  the  four  remaining  coordinates. 


12.2.2  Sample  Complexity  for  the  Measurement  Model  (Robot  Map) 

Here  we  provide  the  details  on  how  our  estimation  error  scales  with  the  number  T  of  training 
examples — that  is,  the  scaling  of  the  difference  between  the  estimated  measurement  model  U, 
which  contains  the  location  of  the  landmarks,  and  its  population  counterpart. 

Our  bound  has  two  parts.  First  we  use  a  standard  concentration  bound  (the  Azuma-Hoeffding 
inequality)  to  show  that  each  element  of  our  estimated  covariance  M  =  YYT  approaches  its 
population  value.  We  start  by  rewriting  the  empirical  covariance  matrix  as  a  vector  summed  over 
multiple  samples: 


vec(M)=i£r, 

t= 1 

where  T  =  (Y  ©  Y)T  is  the  matrix  of  column-wise  Kronecker  products  of  the  observations 
Y.  We  assume  that  each  element  of  T  minus  its  expectation  ET,  is  bounded  by  a  constant 
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c;  we  can  derive  c  from  bounds  on  anticipated  errors  in  distance  measurements  and  odometry 
measurements. 


|TM  — ETil  <c,  Viit 


Then  the  Azuma-Hoeffding  inequality  bounds  the  probability  that  the  empirical  sum  differs  too 
much  from  its  population  value:  for  any  a  >  0  and  any  i. 


P 


^(T^-ET, 


t= 1 


>  a 


<  2e-«2/2TP 


If  we  pick  a  =  \j2Tc 2  log(T),  then  we  can  rewrite  the  probability  in  terms  of  T: 

T 


P 


T 


^(T^-ET, 


t=i 


>  c 


2  log  (T) 
T 


<  2e~ log(T) 


which  means  that  the  probability  decreases  as  O(^)  and  the  threshold  decreases  as  O  ( -^ ) . 
We  can  then  use  a  union  bound  over  all  ( 2N )2  covariance  elements  (since  Y  G  M27VxT): 

T 


P 


Vi 


T 


Tiit  -  ETj 


t=i 


>  c 


21og(T) 

T 


<  8N2/T 


That  is,  with  high  probability,  the  entire  empirical  covariance  matrix  M  will  be  close  (in  max- 
norm)  to  its  expectation. 

Next  we  use  the  continuity  of  the  SVD  to  show  that  the  learned  subspace  approaches  its  true 
value.  Let  M  —  M  +  E,  where  E  is  the  perturbation  (so  the  largest  element  of  E  is  bounded). 
Let  U  be  the  output  of  SVD,  and  let  U  be  the  population  value  (the  top  singular  vectors  of  the 
true  M ).  Let  T  be  the  matrix  of  canonical  angles  between  range((7)  and  range((7).  Since  we 
know  the  exact  rank  of  the  true  M  (either  4  or  7),  the  last  (4th  or  7th)  singular  value  of  M  will 
be  positive;  call  it  7  >  0.  So,  by  Theorem  4.4  of  Stewart  and  Sun  [104], 


sin  1 1 9  < 


\E\ 


7 


This  result  uses  a  2-norm  bound  on  E,  but  the  bound  we  showed  above  is  in  terms  of  the  largest 
element  of  E.  But,  the  2-norm  can  be  bounded  in  terms  of  the  largest  element: 


\E\ |2  <  N  max  \E, 


13 1 


Finally,  the  result  is  that  we  can  bound  the  canonical  angle: 

Nc 


sin  \D  1 1 2  < 


21og(T) 

T 


7 


In  other  words,  the  canonical  angle  shrinks  at  a  rate  of  0{A=),  with  probability  at  least  1  —  7^7. 


137 


12.2.3  The  Robot  as  a  Nonlinear  Dynamical  System 

Once  we  have  learned  an  interpretable  state  space  via  the  algorithm  of  Section  10.3.3,  we  can 
simply  write  down  the  nominal  robot  dynamics  in  this  space.  The  accuracy  of  the  resulting  model 
will  depend  on  how  well  our  sensors  and  actuators  follow  the  nominal  dynamics,  as  well  as  how 
well  we  have  learned  the  transformation  S  to  the  interpretable  version  of  the  state  space. 

In  more  detail,  we  model  the  robot  as  a  controlled  nonlinear  dynamical  system.  The  evolution 
is  governed  by  the  following  state  space  equations,  which  generalize  (10.1): 

st+i  —  /(st,  at)  +  et  (12.5) 

Of  =  h(st)  +  zy  (12.6) 

Here  st  £  denotes  the  hidden  state,  at  £  9J  denotes  the  control  signal,  cy  £  Mm  denotes  the 
observation,  et  £  denotes  the  state  noise,  and  vt  £  Mm  denotes  the  observation  noise.  For  our 
range-only  system,  following  the  decomposition  of  Section  10.3,  we  have: 


1 

4/2 

-Xt 

-yt 

{x2t+y2t)l  2 

-  cos  (6*t) 

,  Ot  = 

d2m/2 

dit+i—di  t 

2  vt 

-  sin(6*t) 
xt+i-xl+vt+i-yt 

(fi  — (p 

aATt+l  uNt 

2  vt 

2  vt 

vt 

cos  (cut) 
sin  (ay) 


(12.7) 


Here  vt  and  ay  are  the  translation  and  rotation  calculated  from  the  robot’s  odometry.  A  nice 
property  of  this  model  is  that  expected  observations  are  a  linear  function  of  state: 

h(st)  =  Cst  (12.8) 

The  dynamics,  however,  are  nonlinear,  see  Eq.  12.9,  which  can  easily  be  derived  from  the 
basic  kinematic  motion  model  for  a  wheeled  robot  [111]. 


f(st,at ) 


1 

-ay  -  vt  cos (9t) 

-Vt  -  vt  sin(6*t) 

^  +  v,xt  cos(0,)  +  v,y,  sm(#t)  + 

—  cos  (9t)  cos(ay)  +  sin(6)f)  sin  (a;*) 

—  sin(#t)  cos  (ay)  +  cos(6)f)  sin  (a;*) 

[xt  cos (9t)  cos(cut)  —  ay  sin(6lt)  sin(cut)  +  vt  cos2(9t )  cos(cut)  + 
yt  sin  (dt)  cos(cut)  -  yt  sin  (ay)  cos  (6>t)  +  vt  sin  2(6t)  cos(a y)  - 
2vt  cos (9t)  sin (0t)  sin(a;t)] 


(12.9) 
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Robot  System  Identification 

To  apply  the  model  of  Section  12.2.3,  it  is  essential  that  we  maintain  states  in  the  physical  coor¬ 
dinate  frame,  and  not  just  the  linearly  transformed  coordinate  frame — i.e.,  C  and  not  U  =  CS~l. 
So,  to  use  this  model,  we  must  first  leam  S  either  by  regression  or  by  metric  upgrade. 

However,  it  is  possible  instead  to  use  system  identification  to  leam  to  filter  directly  in  the  raw 
state  space  U.  We  conjecture  that  it  may  be  more  robust  to  do  so,  since  we  will  not  be  sensitive  to 
errors  in  the  metric  upgrade  process  (errors  in  learning  S ),  and  since  we  can  leam  to  compensate 
for  some  deviations  from  the  nominal  model  of  Section  12.2.3. 

To  derive  our  system  identification  algorithm,  we  can  explicitly  rewrite  f(st,  at)  as  a  nonlin¬ 
ear  feature-expansion  map  followed  by  a  linear  projection.  Our  algorithm  will  then  just  be  to  use 
linear  regression  to  leam  the  linear  part  of  /. 

First,  let’s  look  at  the  dynamics  for  the  special  case  of  S  =  I.  Each  additive  term  in  Eq.  12.9 
is  the  product  of  at  most  two  terms  in  st  and  at  most  two  terms  in  at.  Therefore,  we  define 
4>(st,at)  :=  st  <8>  st  <8>  at  <8>  at,  where  dt  =  [1,  at]T  and  <8>  is  the  Kronecker  product.  (Many  of  the 
dimensions  of  4>(st,  at)  are  duplicates;  for  efficiency  we  would  delete  these  duplicates,  but  for 
simplicity  of  notation  we  keep  them.)  Each  additive  term  in  Eq.  12.9  is  a  multiple  of  an  element 
of  4>(st ,  at ),  so  we  can  write  the  dynamics  as: 

&t+i  =  N(j)(st,  at )  +  et  (12.10) 

where  N  is  a  linear  function  that  picks  out  the  correct  entries  to  form  Eq.  12.9. 

Now,  given  an  invertible  matrix  S,  we  can  rewrite  f(st,  a,  )  as  an  equivalent  function  in  the 
transformed  state  space: 

Sst+i  =  f  (Sst,  at)  +  Set  (12.11) 

To  do  so,  we  use  the  identity  (Ax)  8>  (By)  =  (A  8)  B)(x  <8>  y).  Repeated  application  yields 

4>(Sst,  at)  =  Sst  <8)  Sst  <8)  at  8)  at 

=  (S  <8>  S  8)  I  8)  I)(st  8)  st  8)  at  <8>  at) 

=  S<P(st,at)  (12.12) 

where  S'  =  S'<8>«S'<8>/8>/.  Note  that  S  is  invertible  (since  rank(A  8)  B)  =  rank(A)  rank(T?)); 
so,  we  can  write 


f(Sst,  at)  —  SNS  1S(j)(st,at)  —  Sf(st,at)  (12.13) 

Using  this  representation,  we  can  learn  the  linear  part  of  /,  ,5'Ar,5'_l ,  directly  from  our  state 
estimates:  we  just  do  a  linear  regression  from  <p(Sst ,  at)  to  Sst+ 1. 

For  convenience,  we  summarize  the  entire  learning  algorithm  (state  space  discovery  followed 
by  system  identification)  as  Algorithm  2. 

Filtering  with  the  Extended  Kalman  Filter 

Whether  we  leam  the  dynamics  through  system  identification  or  simply  write  them  down  in  the 
interpretable  version  of  our  state  space,  we  will  end  up  with  a  transition  model  of  the  form  (12.10) 
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and  an  observation  model  of  the  form  (12.8).  Given  these  models,  it  is  easy  to  write  down  an 
EKF  which  tracks  the  robot  state.  The  measurement  update  is  just  a  standard  Kalman  filter 
update  (see,  e.g.,  [11 1]),  since  the  observation  model  is  linear.  For  the  motion  update,  we  need  a 
Taylor  approximation  of  the  expected  state  at  time  t  +  1  around  the  current  MAP  state  st,  given 
the  current  action  at: 

St+i  ~  St  ~  N[(j)(st,at)  +  St(st  —  s*)]  (12.14) 

g| s  =  (s®  I  +  I  ®  s)  ®at®at  (12.15) 

We  simply  plug  this  Taylor  approximation  into  the  standard  Kalman  filter  motion  update  (e.g.,  [111]). 
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Algorithm  2  Robot  System  Identification 

In:  T  i.i.d.  pairs  of  observations  {ot, measurement  model  for  4  landmarks  C1:4  (by  e.g. 
CFS) 

Out:  measurement  model  C,  motion  model  N,  robot  states  X  (the  tth  column  is  state  st) 

1:  Collect  observations  and  odometry  into  a  matrix  Y  (Ec[.  10.7) 

2:  Find  the  the  top  7  singular  values  and  vectors:  ( U ,  A,  VT)  SVD(Y',  7) 

3:  Find  the  transformed  measurement  matrix  CS~l  =  U  and  robot  states  SX  =  AF 
4:  Compute  a  matrix  $  with  columns  =  4>(Sst ,  at). 

5:  Compute  dynamics:  SNS-1  =  SX2:T($i:T_1)t 

6:  Compute  the  partial  S'-1:  S'-1  =  C4.\{C\-AS~l)  where  CS~l  comes  from  step  3.  S'_1A^ 
gives  us  the  x,  y  coordinates  of  the  states.  These  can  be  used  to  find  X  (see  Section  10.3.2) 
7:  Given  X,  we  can  compute  the  full  S  as  S  =  (SX)XJ[ 

8:  Finally,  from  steps  3,5,  and  7,  we  find  the  interpretable  measurement  model  (6',5'_l  ),S'  and 
motion  model  N  =  S~1(SNS~1)S. 
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